Discussion:
Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
KHATWANI PARTH BHARAT
2017-03-31 15:40:50 UTC
Permalink
Sir,
I am trying to write the kmeans clustering algorithm using Mahout Samsara
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can anybody help
me with same.





Thanks
Parth Khatwani
Dmitriy Lyubimov
2017-03-31 16:40:34 UTC
Permalink
was my reply for your post on @user has been a bit confusing?

On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using Mahout Samsara
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can anybody help
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-03-31 16:45:12 UTC
Permalink
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have mentioned.
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using Mahout Samsara
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can anybody
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-03-31 16:54:22 UTC
Permalink
@Dmitriycan you please again tell me the approach to move ahead.


Thanks
Parth Khatwani


On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have mentioned.
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using Mahout
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can anybody
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
Dmitriy Lyubimov
2017-03-31 17:53:05 UTC
Permalink
here is the outline. For details of APIs, please refer to samsara manual
[2], i will not be be repeating it.

Assume your training data input is m x n matrix A. For simplicity let's
assume it's a DRM with int row keys, i.e., DrmLike[Int].

Initialization:

First, classic k-means starts by selecting initial clusters, by sampling
them out. You can do that by using sampling api [1], thus forming a k x n
in-memory matrix C (current centroids). C is therefore of Mahout's Matrix
type.

You the proceed by alternating between cluster assignments and recompupting
centroid matrix C till convergence based on some test or simply limited by
epoch count budget, your choice.

Cluster assignments: here, we go over current generation of A and recompute
centroid indexes for each row in A. Once we recompute index, we put it into
the row key . You can do that by assigning centroid indices to keys of A
using operator mapblock() (details in [2], [3], [4]). You also need to
broadcast C in order to be able to access it in efficient manner inside
mapblock() closure. Examples of that are plenty given in [2]. Essentially,
in mapblock, you'd reform the row keys to reflect cluster index in C. while
going over A, you'd have a "nearest neighbor" problem to solve for the row
of A and centroids C. This is the bulk of computation really, and there are
a few tricks there that can speed this step up in both exact and
approximate manner, but you can start with a naive search.

Centroid recomputation:
once you assigned centroids to the keys of marix A, you'd want to do an
aggregating transpose of A to compute essentially average of row A grouped
by the centroid key. The trick is to do a computation of (1|A)' which will
results in a matrix of the shape (Counts/sums of cluster rows). This is the
part i find difficult to explain without a latex graphics.

In Samsara, construction of (1|A)' corresponds to DRM expression

(1 cbind A).t (again, see [2]).

So when you compute, say,

B = (1 | A)',

then B is (n+1) x k, so each column contains a vector corresponding to a
cluster 1..k. In such column, the first element would be # of points in the
cluster, and the rest of it would correspond to sum of all points. So in
order to arrive to an updated matrix C, we need to collect B into memory,
and slice out counters (first row) from the rest of it.

So, to compute C:

C <- B (2:,:) each row divided by B(1,:)

(watch out for empty clusters with 0 elements, this will cause lack of
convergence and NaNs in the newly computed C).

This operation obviously uses subblocking and row-wise iteration over B,
for which i am again making reference to [2].


[1]
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L149

[2], Sasmara manual, a bit dated but viable,
http://apache.github.io/mahout/doc/ScalaSparkBindings.html

[3] scaladoc, again, dated but largely viable for the purpose of this
exercise:
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm

[4] mapblock etc.
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps

On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have mentioned.
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using Mahout
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can anybody
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
Dmitriy Lyubimov
2017-03-31 18:04:55 UTC
Permalink
ps1 this assumes row-wise construction of A based on training set of m
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to make
sure it is committed to spark cache (by using checkpoint api), if spark is
used
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to samsara manual
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For simplicity let's
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by sampling
them out. You can do that by using sampling api [1], thus forming a k x n
in-memory matrix C (current centroids). C is therefore of Mahout's Matrix
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some test or
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A and
recompute centroid indexes for each row in A. Once we recompute index, we
put it into the row key . You can do that by assigning centroid indices to
keys of A using operator mapblock() (details in [2], [3], [4]). You also
need to broadcast C in order to be able to access it in efficient manner
inside mapblock() closure. Examples of that are plenty given in [2].
Essentially, in mapblock, you'd reform the row keys to reflect cluster
index in C. while going over A, you'd have a "nearest neighbor" problem to
solve for the row of A and centroids C. This is the bulk of computation
really, and there are a few tricks there that can speed this step up in
both exact and approximate manner, but you can start with a naive search.
once you assigned centroids to the keys of marix A, you'd want to do an
aggregating transpose of A to compute essentially average of row A grouped
by the centroid key. The trick is to do a computation of (1|A)' which will
results in a matrix of the shape (Counts/sums of cluster rows). This is the
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM expression
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector corresponding to a
cluster 1..k. In such column, the first element would be # of points in the
cluster, and the rest of it would correspond to sum of all points. So in
order to arrive to an updated matrix C, we need to collect B into memory,
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause lack of
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration over B,
for which i am again making reference to [2].
[1] https://github.com/apache/mahout/blob/master/math-scala/
src/main/scala/org/apache/mahout/math/drm/package.scala#L149
[2], Sasmara manual, a bit dated but viable, http://apache.github.
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose of this
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm
[4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have mentioned.
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using Mahout
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-04-12 17:29:48 UTC
Permalink
@Dmitriy Sir

I have completed the Kmeans code as per the algorithm you have Outline above

My code is as follows

This code works fine till step number 10

In step 11 i am assigning the new centriod index to corresponding row key
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using incorrect
syntax

Can you help me find out what am i doing wrong.


//start of main method

def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new SparkContext(conf))

//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))

//3. convert data read in as string in to array of double
val test = lines.map(line => line.split('\t').map(_.toDouble))

//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while calculating
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)

//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))

//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v,
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike) //8.
seperating the column having all ones created in step 4 and will use
it later val oneVector = matrix(::, 0 until 1) //9. final
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1 until
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false) centriods.size
//10. Broad Casting the initial centriods val broadCastMatrix =
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row <- 0
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods) //13.
assigning closest index to key keys(row) = closesetIndex
} keys -> block }

//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val bTranspose
= (oneVector cbind dataDrmX).t // after step 15 bTranspose will
have data in the following format /*(n+1)*K where n=dimension
of the data point, K=number of clusters * zeroth row will contain
the count of points assigned to each cluster * assuming 3d data
points * */


val nrows = b.nrow.toInt //16. slicing the count vectors out
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0, ::))
val vectorSums = b(1 until nrows, ::) //17. dividing the data
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) { block(row,
::) /= pointCountVectors } keys -> block } //18.
seperating the count vectors val newCentriods = vectorSums.t(::,1
until centriods.size) //19. iterate over the above code
till convergence criteria is meet }//end of main method



// method to find the closest centriod to data point( vec: Vector
in the arguments) def findTheClosestCentriod(vec: Vector, matrix:
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}

//calculating the sum of squared distance between the points(Vectors)
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}

//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}


Thanks & Regards
Parth Khatwani



On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout
Samsara"
ps1 this assumes row-wise construction of A based on training set of m
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to make
sure it is committed to spark cache (by using checkpoint api), if spark is
used
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to samsara manual
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For simplicity let's
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by sampling
them out. You can do that by using sampling api [1], thus forming a k x n
in-memory matrix C (current centroids). C is therefore of Mahout's Matrix
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some test or
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A and
recompute centroid indexes for each row in A. Once we recompute index, we
put it into the row key . You can do that by assigning centroid indices
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3], [4]). You also
need to broadcast C in order to be able to access it in efficient manner
inside mapblock() closure. Examples of that are plenty given in [2].
Essentially, in mapblock, you'd reform the row keys to reflect cluster
index in C. while going over A, you'd have a "nearest neighbor" problem
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of computation
really, and there are a few tricks there that can speed this step up in
both exact and approximate manner, but you can start with a naive search.
once you assigned centroids to the keys of marix A, you'd want to do an
aggregating transpose of A to compute essentially average of row A
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of (1|A)' which
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster rows). This is
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM expression
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector corresponding to a
cluster 1..k. In such column, the first element would be # of points in
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all points. So in
order to arrive to an updated matrix C, we need to collect B into memory,
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause lack of
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration over B,
for which i am again making reference to [2].
[1] https://github.com/apache/mahout/blob/master/math-scala/
src/main/scala/org/apache/mahout/math/drm/package.scala#L149
[2], Sasmara manual, a bit dated but viable, http://apache.github.
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose of this
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm
[4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have mentioned.
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using Mahout
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
Dmitriy Lyubimov
2017-04-12 18:25:06 UTC
Permalink
can't say i can read this code well formatted that way...

it would seem to me that the code is not using the broadcast variable and
instead is using closure variable. that's the only thing i can immediately
see by looking in the middle of it.

it would be better if you created a branch on github for that code that
would allow for easy check-outs and comments.

-d

On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you have Outline above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to corresponding row key
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using incorrect
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new SparkContext(conf))
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of double
val test = lines.map(line => line.split('\t').map(_.toDouble))
//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while calculating
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v,
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike) //8.
seperating the column having all ones created in step 4 and will use
it later val oneVector = matrix(::, 0 until 1) //9. final
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1 until
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false) centriods.size
//10. Broad Casting the initial centriods val broadCastMatrix =
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row <- 0
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods) //13.
assigning closest index to key keys(row) = closesetIndex
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val bTranspose
= (oneVector cbind dataDrmX).t // after step 15 bTranspose will
have data in the following format /*(n+1)*K where n=dimension
of the data point, K=number of clusters * zeroth row will contain
the count of points assigned to each cluster * assuming 3d data
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors out
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0, ::))
val vectorSums = b(1 until nrows, ::) //17. dividing the data
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) { block(row,
::) /= pointCountVectors } keys -> block } //18.
seperating the count vectors val newCentriods = vectorSums.t(::,1
until centriods.size) //19. iterate over the above code
till convergence criteria is meet }//end of main method
// method to find the closest centriod to data point( vec: Vector
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the points(Vectors)
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout
Samsara"
ps1 this assumes row-wise construction of A based on training set of m
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to make
sure it is committed to spark cache (by using checkpoint api), if spark
is
used
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to samsara
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For simplicity let's
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus forming a k
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of Mahout's
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some test or
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A and
recompute centroid indexes for each row in A. Once we recompute index,
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning centroid indices
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3], [4]). You
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in efficient
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given in [2].
Essentially, in mapblock, you'd reform the row keys to reflect cluster
index in C. while going over A, you'd have a "nearest neighbor" problem
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of computation
really, and there are a few tricks there that can speed this step up in
both exact and approximate manner, but you can start with a naive
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd want to do an
aggregating transpose of A to compute essentially average of row A
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of (1|A)' which
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster rows). This is
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM expression
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector corresponding to
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of points in
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all points. So
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B into
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause lack of
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration over
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/mahout/blob/master/math-scala/
src/main/scala/org/apache/mahout/math/drm/package.scala#L149
[2], Sasmara manual, a bit dated but viable, http://apache.github.
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose of this
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm
[4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using Mahout
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-04-12 18:50:10 UTC
Permalink
Ok i will do that.
Post by Dmitriy Lyubimov
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast variable and
instead is using closure variable. that's the only thing i can immediately
see by looking in the middle of it.
it would be better if you created a branch on github for that code that
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you have Outline above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to corresponding row
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using incorrect
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new SparkContext(conf))
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of double
val test = lines.map(line => line.split('\t').map(_.toDouble))
//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while calculating
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v,
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike) //8.
seperating the column having all ones created in step 4 and will use
it later val oneVector = matrix(::, 0 until 1) //9. final
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1 until
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false) centriods.size
//10. Broad Casting the initial centriods val broadCastMatrix =
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row <- 0
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods) //13.
assigning closest index to key keys(row) = closesetIndex
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val bTranspose
= (oneVector cbind dataDrmX).t // after step 15 bTranspose will
have data in the following format /*(n+1)*K where n=dimension
of the data point, K=number of clusters * zeroth row will contain
the count of points assigned to each cluster * assuming 3d data
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors out
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0, ::))
val vectorSums = b(1 until nrows, ::) //17. dividing the data
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) { block(row,
::) /= pointCountVectors } keys -> block } //18.
seperating the count vectors val newCentriods = vectorSums.t(::,1
until centriods.size) //19. iterate over the above code
till convergence criteria is meet }//end of main method
// method to find the closest centriod to data point( vec: Vector
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the points(Vectors)
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout
Samsara"
ps1 this assumes row-wise construction of A based on training set of m
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to make
sure it is committed to spark cache (by using checkpoint api), if spark
is
used
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to samsara
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For simplicity
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus forming a k
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of Mahout's
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some test or
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A and
recompute centroid indexes for each row in A. Once we recompute
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning centroid
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3], [4]). You
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in efficient
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given in [2].
Essentially, in mapblock, you'd reform the row keys to reflect
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest neighbor"
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed this step up
in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a naive
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd want to do
an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average of row A
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of (1|A)' which
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster rows). This
is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM expression
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector corresponding
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of points
in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all points. So
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B into
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause lack
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration over
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/mahout/blob/master/math-scala/
src/main/scala/org/apache/mahout/math/drm/package.scala#L149
[2], Sasmara manual, a bit dated but viable, http://apache.github.
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose of this
http://apache.github.io/mahout/0.10.1/docs/mahout-
math-scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-04-13 12:54:21 UTC
Permalink
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial Kmeans Code
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>


Thanks & Regards
Parth Khatwani
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout
Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast variable and
instead is using closure variable. that's the only thing i can immediately
see by looking in the middle of it.
it would be better if you created a branch on github for that code that
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you have Outline above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to corresponding row
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using incorrect
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new SparkContext(conf))
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of double
val test = lines.map(line => line.split('\t').map(_.toDouble))
//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while calculating
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v,
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike) //8.
seperating the column having all ones created in step 4 and will use
it later val oneVector = matrix(::, 0 until 1) //9. final
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1 until
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false) centriods.size
//10. Broad Casting the initial centriods val broadCastMatrix =
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row <- 0
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods) //13.
assigning closest index to key keys(row) = closesetIndex
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val bTranspose
= (oneVector cbind dataDrmX).t // after step 15 bTranspose will
have data in the following format /*(n+1)*K where n=dimension
of the data point, K=number of clusters * zeroth row will contain
the count of points assigned to each cluster * assuming 3d data
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors out
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0, ::))
val vectorSums = b(1 until nrows, ::) //17. dividing the data
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) { block(row,
::) /= pointCountVectors } keys -> block } //18.
seperating the count vectors val newCentriods = vectorSums.t(::,1
until centriods.size) //19. iterate over the above code
till convergence criteria is meet }//end of main method
// method to find the closest centriod to data point( vec: Vector
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the points(Vectors)
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout
Samsara"
ps1 this assumes row-wise construction of A based on training set of m
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to make
sure it is committed to spark cache (by using checkpoint api), if spark
is
used
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to samsara
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For simplicity
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus forming a k
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of Mahout's
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some test or
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A and
recompute centroid indexes for each row in A. Once we recompute
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning centroid
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3], [4]). You
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in efficient
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given in [2].
Essentially, in mapblock, you'd reform the row keys to reflect
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest neighbor"
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed this step up
in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a naive
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd want to do
an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average of row A
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of (1|A)' which
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster rows). This
is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM expression
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector corresponding
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of points
in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all points. So
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B into
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause lack
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration over
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/mahout/blob/master/math-scala/
src/main/scala/org/apache/mahout/math/drm/package.scala#L149
[2], Sasmara manual, a bit dated but viable, http://apache.github.
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose of this
http://apache.github.io/mahout/0.10.1/docs/mahout-
math-scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-04-14 09:26:52 UTC
Permalink
@Dmitriy Sir,
In the K means code above I think i am doing the following Incorrectly

Assigning the closest centriod index to the Row Keys of DRM

//11. Iterating over the Data Matrix(in DrmLike[Int] format) to calculate
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)

//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint, centriods)

//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}

in step 12 i am finding the centriod closest to the current dataPoint
in step13 i am assigning the closesetIndex to the key of the corresponding
row represented by the dataPoint
I think i am doing step13 incorrectly.

Also i am unable to find the proper reference for the same in the reference
links which you have mentioned above


Thanks & Regards
Parth Khatwani





On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial Kmeans Code
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>
Thanks & Regards
Parth Khatwani
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout
Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast variable and
instead is using closure variable. that's the only thing i can immediately
see by looking in the middle of it.
it would be better if you created a branch on github for that code that
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you have Outline above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to corresponding row
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of double
val test = lines.map(line => line.split('\t').map(_.toDouble))
//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while calculating
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v,
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike) //8.
seperating the column having all ones created in step 4 and will use
it later val oneVector = matrix(::, 0 until 1) //9. final
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1 until
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false) centriods.size
//10. Broad Casting the initial centriods val broadCastMatrix =
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row <- 0
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods) //13.
assigning closest index to key keys(row) = closesetIndex
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val bTranspose
= (oneVector cbind dataDrmX).t // after step 15 bTranspose will
have data in the following format /*(n+1)*K where n=dimension
of the data point, K=number of clusters * zeroth row will contain
the count of points assigned to each cluster * assuming 3d data
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors out
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0, ::))
val vectorSums = b(1 until nrows, ::) //17. dividing the data
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) { block(row,
::) /= pointCountVectors } keys -> block } //18.
seperating the count vectors val newCentriods = vectorSums.t(::,1
until centriods.size) //19. iterate over the above code
till convergence criteria is meet }//end of main method
// method to find the closest centriod to data point( vec: Vector
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the points(Vectors)
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using "Apache
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A based on training set of m
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using checkpoint api), if
spark
Post by KHATWANI PARTH BHARAT
is
used
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to samsara
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For simplicity
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus forming a
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of Mahout's
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some test
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A and
recompute centroid indexes for each row in A. Once we recompute
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning centroid
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3], [4]). You
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in efficient
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given in [2].
Essentially, in mapblock, you'd reform the row keys to reflect
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest neighbor"
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed this step
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a naive
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd want to
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average of row A
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of (1|A)'
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster rows).
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM expression
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector corresponding
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all points.
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B into
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause lack
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/mahout/blob/master/math-scala/
src/main/scala/org/apache/mahout/math/drm/package.scala#L149
[2], Sasmara manual, a bit dated but viable, http://apache.github.
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose of
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
Dmitriy Lyubimov
2017-04-14 18:27:46 UTC
Permalink
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply that effect
should be idempotent if task is retried, which might be a problem in a
specific scenario of the object tree coming out from block cache object
tree, which can stay there and be retried again. but specifically w.r.t.
this key assignment i don't see any problem since the action obviously
would be idempotent even if this code is run multiple times on the same
(key, block) pair. This part should be good IMO.

On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following Incorrectly
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format) to calculate
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint, centriods)
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current dataPoint
in step13 i am assigning the closesetIndex to the key of the corresponding
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in the reference
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial Kmeans Code
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>
Thanks & Regards
Parth Khatwani
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout
Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast variable
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the only thing i can
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on github for that code that
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you have Outline above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to corresponding row
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of double
val test = lines.map(line => line.split('\t').map(_.toDouble))
//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while calculating
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v,
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike) //8.
seperating the column having all ones created in step 4 and will use
it later val oneVector = matrix(::, 0 until 1) //9. final
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1 until
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false) centriods.size
//10. Broad Casting the initial centriods val broadCastMatrix =
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row <- 0
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods) //13.
assigning closest index to key keys(row) = closesetIndex
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val bTranspose
= (oneVector cbind dataDrmX).t // after step 15 bTranspose will
have data in the following format /*(n+1)*K where n=dimension
of the data point, K=number of clusters * zeroth row will contain
the count of points assigned to each cluster * assuming 3d data
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors out
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0, ::))
val vectorSums = b(1 until nrows, ::) //17. dividing the data
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) { block(row,
::) /= pointCountVectors } keys -> block } //18.
seperating the count vectors val newCentriods = vectorSums.t(::,1
until centriods.size) //19. iterate over the above code
till convergence criteria is meet }//end of main method
// method to find the closest centriod to data point( vec: Vector
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using "Apache
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A based on training set
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using checkpoint api), if
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to samsara
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For simplicity
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus forming
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of Mahout's
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some test
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A and
recompute centroid indexes for each row in A. Once we recompute
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning centroid
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3], [4]).
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in efficient
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given in
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to reflect
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest neighbor"
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed this step
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a naive
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd want to
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average of row A
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of (1|A)'
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster rows).
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM expression
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all points.
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B into
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/mahout/blob/master/math-scala/
src/main/scala/org/apache/mahout/math/drm/package.scala#L149
[2], Sasmara manual, a bit dated but viable, http://apache.github
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose of
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same.
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
Trevor Grant
2017-04-14 18:40:43 UTC
Permalink
Parth and Dmitriy,

This is awesome- as a follow on can we work on getting this rolled in to
the algorithms framework?

Happy to work with you on this Parth!

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply that effect
should be idempotent if task is retried, which might be a problem in a
specific scenario of the object tree coming out from block cache object
tree, which can stay there and be retried again. but specifically w.r.t.
this key assignment i don't see any problem since the action obviously
would be idempotent even if this code is run multiple times on the same
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following Incorrectly
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format) to calculate
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint,
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current dataPoint
in step13 i am assigning the closesetIndex to the key of the
corresponding
Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in the
reference
Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial Kmeans Code
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>
Thanks & Regards
Parth Khatwani
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast variable
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the only thing i can
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on github for that code
that
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you have
Outline
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to corresponding
row
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of double
val test = lines.map(line => line.split('\t').map(_.toDouble))
//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while
calculating
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v,
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike) //8.
seperating the column having all ones created in step 4 and will use
it later val oneVector = matrix(::, 0 until 1) //9. final
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1
until
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false) centriods.size
//10. Broad Casting the initial centriods val broadCastMatrix =
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row <-
0
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods)
//13.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) = closesetIndex
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15 bTranspose will
have data in the following format /*(n+1)*K where n=dimension
of the data point, K=number of clusters * zeroth row will contain
the count of points assigned to each cluster * assuming 3d data
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors out
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0,
::))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17. dividing the data
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) {
block(row,
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block } //18.
seperating the count vectors val newCentriods = vectorSums.t(::,1
until centriods.size) //19. iterate over the above code
till convergence criteria is meet }//end of main method
// method to find the closest centriod to data point( vec: Vector
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using "Apache
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A based on training set
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using checkpoint api), if
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to
samsara
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For
simplicity
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus
forming
Post by KHATWANI PARTH BHARAT
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of
Mahout's
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some
test
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A
and
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we recompute
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning centroid
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3], [4]).
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in
efficient
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given in
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to reflect
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest neighbor"
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed this
step
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a
naive
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd want
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average of
row A
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of (1|A)'
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster rows).
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM expression
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all
points.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B
into
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/mahout/blob/master/math-scala/
src/main/scala/org/apache/mahout/math/drm/package.scala#L149
[2], Sasmara manual, a bit dated but viable,
http://apache.github
Post by KHATWANI PARTH BHARAT
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose of
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move
ahead.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <
confusing?
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same.
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-04-15 04:37:22 UTC
Permalink
@Dmitriy,
@Trevor and @Andrew

I have tried
Testing this Row Key assignment issue which i have mentioned in the above
mail,
By Writing the a separate code where i am assigning the a default value 1
to each row Key of The DRM and then taking the aggregating transpose
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>.

The Code is as follows

val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5, 6))
val A = drmParallelize(m = inCoreA)

//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {

* //assigning 1 to each row index* keys(row) = 1
} (keys, block) } prinln("After New Cluster
assignment") println(""+drm2.collect) val aggTranspose = drm2.t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)

Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2 and
3 Contains Data)

But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}


I have referred to the book written by Andrew And Dmitriy Apache Mahout:
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-Dmitriy-Lyubimov/dp/1523775785>
Aggregating
Transpose and other concepts are explained very nicely over here but i am
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does not
contain any such examples.
It will great if i can get some reference to solution of mentioned issue.


Thanks
Parth Khatwani
+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout
Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this rolled in to
the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply that effect
should be idempotent if task is retried, which might be a problem in a
specific scenario of the object tree coming out from block cache object
tree, which can stay there and be retried again. but specifically w.r.t.
this key assignment i don't see any problem since the action obviously
would be idempotent even if this code is run multiple times on the same
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following Incorrectly
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format) to
calculate
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint,
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current dataPoint
in step13 i am assigning the closesetIndex to the key of the
corresponding
Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in the
reference
Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial Kmeans
Code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>
Thanks & Regards
Parth Khatwani
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast
variable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the only thing i can
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on github for that code
that
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you have
Outline
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to corresponding
row
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of double
val test = lines.map(line => line.split('\t').map(_.
toDouble))
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while
calculating
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case
(v,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike)
//8.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4 and will
use
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1) //9.
final
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1
until
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false) centriods.size
//10. Broad Casting the initial centriods val broadCastMatrix =
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row
<-
Post by Dmitriy Lyubimov
0
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods)
//13.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) = closesetIndex
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15 bTranspose will
have data in the following format /*(n+1)*K where
n=dimension
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row will
contain
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster * assuming 3d data
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors
out
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0,
::))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17. dividing the
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) {
block(row,
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block }
//18.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =
vectorSums.t(::,1
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the above code
till convergence criteria is meet }//end of main method
Vector
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using "Apache
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A based on training
set
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense
to
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using checkpoint api),
if
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to
samsara
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For
simplicity
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters,
by
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus
forming
Post by KHATWANI PARTH BHARAT
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of
Mahout's
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some
test
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A
and
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we
recompute
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning
centroid
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3],
[4]).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in
efficient
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given
in
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to reflect
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest
neighbor"
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed this
step
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a
naive
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd want
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average of
row A
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of
(1|A)'
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster
rows).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM
expression
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all
points.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B
into
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise
iteration
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/mahout/blob/master/math-scala/
src/main/scala/org/apache/mahout/math/drm/package.scala#L149
[2], Sasmara manual, a bit dated but viable,
http://apache.github
Post by KHATWANI PARTH BHARAT
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose
of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move
ahead.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which
you
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <
confusing?
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm
using
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the
same.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-04-19 04:32:50 UTC
Permalink
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out with it.
I am unable to find the proper reference to solve the above issue.

Thanks & Regards
Parth Khatwani








<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&referral=***@pilani.bits-pilani.ac.in&idSignature=22>

On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in the above
mail,
By Writing the a separate code where i am assigning the a default value 1
to each row Key of The DRM and then taking the aggregating transpose
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>.
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5, 6))
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1 } (keys, block) } prinln("After New Cluster assignment") println(""+drm2.collect) val aggTranspose = drm2.t println("Result of aggregating tranpose") println(""+aggTranspose.collect)
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2 and
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-Dmitriy-Lyubimov/dp/1523775785> Aggregating
Transpose and other concepts are explained very nicely over here but i am
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does not
contain any such examples.
It will great if i can get some reference to solution of mentioned issue.
Thanks
Parth Khatwani
+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout
Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this rolled in to
the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply that effect
should be idempotent if task is retried, which might be a problem in a
specific scenario of the object tree coming out from block cache object
tree, which can stay there and be retried again. but specifically w.r.t.
this key assignment i don't see any problem since the action obviously
would be idempotent even if this code is run multiple times on the same
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following Incorrectly
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format) to
calculate
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to
the
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint,
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current dataPoint
in step13 i am assigning the closesetIndex to the key of the
corresponding
Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in the
reference
Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial Kmeans
Code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>
Thanks & Regards
Parth Khatwani
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast
variable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the only thing i can
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on github for that code
that
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you have
Outline
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to
corresponding
Post by Dmitriy Lyubimov
row
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of double
val test = lines.map(line => line.split('\t').map(_.toDoubl
e))
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while
calculating
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case
(v,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike)
//8.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4 and will
use
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1) //9.
final
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1
until
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false) centriods.size
//10. Broad Casting the initial centriods val broadCastMatrix
=
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row
<-
Post by Dmitriy Lyubimov
0
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods)
//13.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) = closesetIndex
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15 bTranspose
will
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K where
n=dimension
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row will
contain
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster * assuming 3d
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors
out
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0,
::))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17. dividing the
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) {
block(row,
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block }
//18.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =
vectorSums.t(::,1
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the above
code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main method
Vector
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
in the arguments) def findTheClosestCentriod(vec: Vector,
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A based on training
set
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over A it may make
sense to
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using checkpoint api),
if
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to
samsara
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For
simplicity
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters,
by
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus
forming
Post by KHATWANI PARTH BHARAT
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of
Mahout's
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence based on some
test
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A
and
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we
recompute
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning
centroid
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3],
[4]).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in
efficient
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given
in
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to
reflect
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest
neighbor"
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed this
step
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a
naive
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd
want
Post by Dmitriy Lyubimov
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average of
row A
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of
(1|A)'
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster
rows).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM
expression
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all
points.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B
into
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will
cause
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise
iteration
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/mahout/blob/master/math-scala/
src/main/scala/org/apache/mahout/math/drm/package.scala#L149
[2], Sasmara manual, a bit dated but viable,
http://apache.github
Post by KHATWANI PARTH BHARAT
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the
purpose of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move
ahead.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which
you
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <
confusing?
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm
using
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the
same.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
Trevor Grant
2017-04-20 22:50:37 UTC
Permalink
Hey

Sorry for delay- was getting ready to tear into this.

Would you mind posting a small sample of data that you would expect this
application to consume.

tg


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*


On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out with it.
I am unable to find the proper reference to solve the above issue.
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in the above
mail,
By Writing the a separate code where i am assigning the a default value 1
to each row Key of The DRM and then taking the aggregating transpose
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>.
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5, 6))
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1
} (keys, block) } prinln("After New Cluster assignment")
println(""+drm2.collect) val aggTranspose = drm2.t println("Result of
aggregating tranpose") println(""+aggTranspose.collect)
Post by KHATWANI PARTH BHARAT
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-
Dmitriy-Lyubimov/dp/1523775785> Aggregating
Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over here but i
am
Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does not
contain any such examples.
It will great if i can get some reference to solution of mentioned issue.
Thanks
Parth Khatwani
+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout
Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this rolled in to
the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply that
effect
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a problem in a
specific scenario of the object tree coming out from block cache
object
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but specifically
w.r.t.
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the action obviously
would be idempotent even if this code is run multiple times on the
same
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following
Incorrectly
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format) to
calculate
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to
the
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint,
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current
dataPoint
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of the
corresponding
Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in the
reference
Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial Kmeans
Code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast
variable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the only thing i can
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on github for that
code
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
that
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you have
Outline
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to
corresponding
Post by Dmitriy Lyubimov
row
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of double
val test = lines.map(line => line.split('\t').map(_.toDoubl
e))
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double this
will
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used while
calculating
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map {
case
Post by KHATWANI PARTH BHARAT
(v,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike)
//8.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4 and
will
Post by KHATWANI PARTH BHARAT
use
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1) //9.
final
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX = matrix(::,
1
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
until
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false)
centriods.size
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val
broadCastMatrix
Post by KHATWANI PARTH BHARAT
=
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating over the
Data
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the initial
centriods
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) => for
(row
Post by KHATWANI PARTH BHARAT
<-
Post by Dmitriy Lyubimov
0
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod
to
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val closesetIndex
=
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)
//13.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =
closesetIndex
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15 bTranspose
will
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K where
n=dimension
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row will
contain
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster * assuming 3d
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors
out
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,
::).collect(0,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
::))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17. dividing the
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() { case
(keys,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {
block(row,
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block }
//18.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =
vectorSums.t(::,1
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the above
code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main method
Vector
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
in the arguments) def findTheClosestCentriod(vec: Vector,
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A based on training
set
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over A it may make
sense to
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using checkpoint
api),
Post by KHATWANI PARTH BHARAT
if
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to
samsara
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For
simplicity
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial
clusters,
Post by KHATWANI PARTH BHARAT
by
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus
forming
Post by KHATWANI PARTH BHARAT
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of
Mahout's
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence based on
some
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
test
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation
of A
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
and
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we
recompute
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning
centroid
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3],
[4]).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in
efficient
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty
given
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to
reflect
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest
neighbor"
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
step
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a
naive
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd
want
Post by Dmitriy Lyubimov
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
row A
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of
(1|A)'
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster
rows).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM
expression
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be #
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all
points.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect
B
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
into
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will
cause
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise
iteration
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/
mahout/blob/master/math-scala/
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
src/main/scala/org/apache/mahout/math/drm/package.scala#
L149
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,
http://apache.github
Post by KHATWANI PARTH BHARAT
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the
purpose of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.
RLikeDrmOps
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move
ahead.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which
you
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <
confusing?
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm
using
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the
same.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-04-21 02:01:33 UTC
Permalink
@Trevor Sir,
I have attached the sample data file and here is the line to complete the Data
File <https://drive.google.com/open?id=0Bxnnu_Ig2Et9QjZoM3dmY1V5WXM>.


Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov

KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-Lyubimov/KmeansMahout.scala>
is
the complete code


I also have made sample program just to test the assigning new values to
the key to Row Matrix and aggregating transpose.I think assigning new
values to the key to Row Matrix and aggregating transpose is causing the
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-Lyubimov/TestClusterAssign.scala>

above code contains the hard coded data. Following is the expected and the
actual output of the above code
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2 and
3 Contains Data)

But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}


Thanks Trevor for such a great Help




Best Regards
Parth
Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would expect this
application to consume.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out with it.
I am unable to find the proper reference to solve the above issue.
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in the
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a default
value 1
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating transpose
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>.
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5, 6))
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1
} (keys, block) } prinln("After New Cluster assignment")
println(""+drm2.collect) val aggTranspose = drm2.t println("Result
of
Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)
Post by KHATWANI PARTH BHARAT
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy Apache
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-
Dmitriy-Lyubimov/dp/1523775785> Aggregating
Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over here but i
am
Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does
not
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of mentioned
issue.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this rolled in
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply that
effect
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a problem
in a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from block cache
object
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but specifically
w.r.t.
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the action
obviously
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple times on the
same
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following
Incorrectly
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format) to
calculate
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint,
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current
dataPoint
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of the
corresponding
Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in the
reference
Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial
Kmeans
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-
Lyubimov>
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast
variable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the only thing i can
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on github for that
code
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
that
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Outline
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to
corresponding
Post by Dmitriy Lyubimov
row
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am
using
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of double
val test = lines.map(line =>
line.split('\t').map(_.toDoubl
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
e))
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double this
will
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used while
calculating
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map {
case
Post by KHATWANI PARTH BHARAT
(v,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike)
//8.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4 and
will
Post by KHATWANI PARTH BHARAT
use
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1) //9.
final
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX =
matrix(::,
Post by KHATWANI PARTH BHARAT
1
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
until
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods
val
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)
centriods.size
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val
broadCastMatrix
Post by KHATWANI PARTH BHARAT
=
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating over the
Data
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the initial
centriods
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) => for
(row
Post by KHATWANI PARTH BHARAT
<-
Post by Dmitriy Lyubimov
0
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest
centriod
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val
closesetIndex
Post by KHATWANI PARTH BHARAT
=
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)
//13.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =
closesetIndex
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15 bTranspose
will
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K where
n=dimension
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row will
contain
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster * assuming 3d
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the count
vectors
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
out
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,
::).collect(0,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
::))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17. dividing
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() { case
(keys,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {
block(row,
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block }
//18.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =
vectorSums.t(::,1
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the above
code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main method
Vector
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
in the arguments) def findTheClosestCentriod(vec: Vector,
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] =
{
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A based on
training
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
set
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over A it may make
sense to
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using checkpoint
api),
Post by KHATWANI PARTH BHARAT
if
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to
samsara
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For
simplicity
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial
clusters,
Post by KHATWANI PARTH BHARAT
by
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus
forming
Post by KHATWANI PARTH BHARAT
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of
Mahout's
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster
assignments
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence based on
some
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
test
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation
of A
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
and
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we
recompute
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning
centroid
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3],
[4]).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in
efficient
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty
given
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to
reflect
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest
neighbor"
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk
of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
step
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start
with a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
naive
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd
want
Post by Dmitriy Lyubimov
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
row A
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of
(1|A)'
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster
rows).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex
graphics.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds to DRM
expression
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be
#
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of
all
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
points.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to
collect
Post by KHATWANI PARTH BHARAT
B
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
into
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will
cause
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise
iteration
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/
mahout/blob/master/math-scala/
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
src/main/scala/org/apache/mahout/math/drm/package.scala#
L149
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,
http://apache.github
Post by KHATWANI PARTH BHARAT
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the
purpose of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.
RLikeDrmOps
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to
move
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
ahead.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D)
which
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
you
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <
confusing?
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH
BHARAT
Post by KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering
algorithm
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
using
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
same.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
Trevor Grant
2017-04-21 16:36:27 UTC
Permalink
OK, i dug into this before i read your question carefully, that was my bad.

Assuming you want the aggregate transpose of :
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}

to be
{
0 => {1: 5.0} // (not 4.0) // and 6.0 in your example...
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}


Then why not replace the mapBlock statement as follows:

val drm2 = (A(::, 1 until 4) cbind 0.0).mapBlock() {
case (keys, block) =>
for(row <- 0 until block.nrow) block(row, 3) = block(row, ::).sum
(keys, block)
}
val aggTranspose = drm2(::, 3 until 4).t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)

Where we are creating an empty row, then filling it with the row sums.

A distributed rowSums fn would be nice for just such an occasion... sigh

Let me know if that gets you going again. That was simpler than I thought-
sorry for delay on this.

PS
Candidly, I didn't explore further once i understood teh question, but if
you are going to collect this to the driver anyway (not sure if that is the
case)
A(::, 1 until 4).rowSums would also work.





Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*


On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to complete the Data
File <https://drive.google.com/open?id=0Bxnnu_Ig2Et9QjZoM3dmY1V5WXM>.
Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov
KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-Lyubimov/KmeansMahout.scala> is
the complete code
I also have made sample program just to test the assigning new values to
the key to Row Matrix and aggregating transpose.I think assigning new
values to the key to Row Matrix and aggregating transpose is causing the
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-Lyubimov/TestClusterAssign.scala>
above code contains the hard coded data. Following is the expected and the
actual output of the above code
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2 and
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would expect this
application to consume.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out with
it.
Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the above issue.
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in the
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a default
value 1
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating transpose
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>.
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5,
6))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1
} (keys, block) } prinln("After New Cluster assignment")
println(""+drm2.collect) val aggTranspose = drm2.t
println("Result of
Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)
Post by KHATWANI PARTH BHARAT
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy Apache
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-
Dmitriy-Lyubimov/dp/1523775785> Aggregating
Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over here but
i
Post by KHATWANI PARTH BHARAT
am
Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does
not
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of mentioned
issue.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this rolled
in to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply that
effect
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a problem
in a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from block cache
object
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but specifically
w.r.t.
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the action
obviously
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple times on the
same
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following
Incorrectly
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format) to
calculate
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint,
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current
dataPoint
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of the
corresponding
Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in the
reference
Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial
Kmeans
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyub
imov>
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast
variable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the only thing i can
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on github for that
code
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
that
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Outline
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to
corresponding
Post by Dmitriy Lyubimov
row
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am
using
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of
double
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>
line.split('\t').map(_.toDoubl
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
e))
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double this
will
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used while
calculating
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of
DenseVector
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map {
case
Post by KHATWANI PARTH BHARAT
(v,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike)
//8.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4 and
will
Post by KHATWANI PARTH BHARAT
use
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1)
//9.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
final
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX =
matrix(::,
Post by KHATWANI PARTH BHARAT
1
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
until
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods
val
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)
centriods.size
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val
broadCastMatrix
Post by KHATWANI PARTH BHARAT
=
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating over the
Data
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the initial
centriods
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) => for
(row
Post by KHATWANI PARTH BHARAT
<-
Post by Dmitriy Lyubimov
0
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest
centriod
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val
closesetIndex
Post by KHATWANI PARTH BHARAT
=
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)
//13.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =
closesetIndex
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector
cbind
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose (1|D)' val
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K where
n=dimension
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row will
contain
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster * assuming
3d
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the count
vectors
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
out
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,
::).collect(0,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
::))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17. dividing
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() { case
(keys,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {
block(row,
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block }
//18.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =
vectorSums.t(::,1
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main method
Vector
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
in the arguments) def findTheClosestCentriod(vec: Vector,
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double]
= {
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A based on
training
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
set
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over A it may make
sense to
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using checkpoint
api),
Post by KHATWANI PARTH BHARAT
if
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
samsara
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For
simplicity
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial
clusters,
Post by KHATWANI PARTH BHARAT
by
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1],
thus
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
forming
Post by KHATWANI PARTH BHARAT
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore
of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout's
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster
assignments
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence based on
some
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
test
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation
of A
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
and
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we
recompute
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning
centroid
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2],
[3],
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
[4]).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in
efficient
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty
given
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to
reflect
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest
neighbor"
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the
bulk of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
step
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start
with a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
naive
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A,
you'd
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
want
Post by Dmitriy Lyubimov
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially
average
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
row A
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of
(1|A)'
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster
rows).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex
graphics.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds to DRM
expression
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would
be #
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of
all
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
points.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to
collect
Post by KHATWANI PARTH BHARAT
B
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
into
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will
cause
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise
iteration
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/
mahout/blob/master/math-scala/
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho
ut/math/drm/package.scala#
Post by KHATWANI PARTH BHARAT
L149
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,
http://apache.github
Post by KHATWANI PARTH BHARAT
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the
purpose of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.
RLikeDrmOps
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to
move
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
ahead.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D)
which
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
you
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <
confusing?
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH
BHARAT
Post by KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering
algorithm
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
using
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
same.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-04-21 19:26:30 UTC
Permalink
@Trevor



In was trying to write the "*Kmeans*" Using Mahout DRM as per the algorithm
outlined by Dmitriy.
I was facing the Problem of assigning cluster Ids to the Row Keys
For Example
Consider the below matrix Where column 1 to 3 are the data points and
column 0 Containing the count of the point
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}

now after calculating the centriod which closest to the data point data
zeroth index i am trying to assign the centriod index to *row key *

Now Suppose say that every data point is assigned to centriod at index 1
so after assigning the key=1 to each and every row

using the code below

val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {

* //assigning 1 to each row index* keys(row) = 1
} (keys, block) }



I want above matrix to be in this form


{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}




Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}



I am confused weather assigning the new Key Values to the row index is done
through the following code line

* //assigning 1 to each row index* keys(row) = 1


or is there any other way.



I am not able to find any use links or reference on internet even Andrew
and Dmitriy's book also does not have any proper reference for the
above mentioned issue.



Thanks & Regards
Parth Khatwani
Post by Trevor Grant
OK, i dug into this before i read your question carefully, that was my bad.
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to be
{
0 => {1: 5.0} // (not 4.0) // and 6.0 in your example...
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
val drm2 = (A(::, 1 until 4) cbind 0.0).mapBlock() {
case (keys, block) =>
for(row <- 0 until block.nrow) block(row, 3) = block(row, ::).sum
(keys, block)
}
val aggTranspose = drm2(::, 3 until 4).t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)
Where we are creating an empty row, then filling it with the row sums.
A distributed rowSums fn would be nice for just such an occasion... sigh
Let me know if that gets you going again. That was simpler than I thought-
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh question, but if
you are going to collect this to the driver anyway (not sure if that is the
case)
A(::, 1 until 4).rowSums would also work.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to complete
the Data
Post by KHATWANI PARTH BHARAT
File <https://drive.google.com/open?id=0Bxnnu_Ig2Et9QjZoM3dmY1V5WXM>.
Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov
KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-
Lyubimov/KmeansMahout.scala> is
Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning new values to
the key to Row Matrix and aggregating transpose.I think assigning new
values to the key to Row Matrix and aggregating transpose is causing the
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-
Lyubimov/TestClusterAssign.scala>
Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the expected and
the
Post by KHATWANI PARTH BHARAT
actual output of the above code
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would expect this
application to consume.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out with
it.
Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the above issue.
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in the
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a default
value 1
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating transpose
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>.
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5,
6))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1
} (keys, block) } prinln("After New Cluster
assignment")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t
println("Result of
Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)
Post by KHATWANI PARTH BHARAT
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column
1,2
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy Apache
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-
Dmitriy-Lyubimov/dp/1523775785> Aggregating
Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over here
but
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
i
Post by KHATWANI PARTH BHARAT
am
Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also
Does
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
not
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of mentioned
issue.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <
+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this rolled
in to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply that
effect
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a problem
in a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from block cache
object
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but specifically
w.r.t.
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the action
obviously
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple times on
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
same
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following
Incorrectly
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format) to
calculate
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest
centriod
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(
dataPoint,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current
dataPoint
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of the
corresponding
Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
reference
Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial
Kmeans
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyub
imov>
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast
variable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the only thing i
can
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on github for
that
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
code
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
that
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Outline
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to
corresponding
Post by Dmitriy Lyubimov
row
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am
using
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer",
"org.apache.spark.serializer.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.
MahoutKryoRegistrator")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
implicit val sc = new SparkDistributedContext(new
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of
double
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>
line.split('\t').map(_.toDoubl
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
e))
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double
this
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used while
calculating
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of
DenseVector
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map
{
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
case
Post by KHATWANI PARTH BHARAT
(v,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix =
drmWrap(rddMatrixLike)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//8.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4
and
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
use
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1)
//9.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
final
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX =
matrix(::,
Post by KHATWANI PARTH BHARAT
1
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
until
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods
val
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)
centriods.size
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val
broadCastMatrix
Post by KHATWANI PARTH BHARAT
=
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating over
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Data
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the initial
centriods
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) =>
for
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(row
Post by KHATWANI PARTH BHARAT
<-
Post by Dmitriy Lyubimov
0
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row,
::)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the closest
centriod
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val
closesetIndex
Post by KHATWANI PARTH BHARAT
=
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)
//13.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =
closesetIndex
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector
cbind
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose (1|D)' val
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K where
n=dimension
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster * assuming
3d
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the count
vectors
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
out
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,
::).collect(0,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
::))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17. dividing
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() { case
(keys,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {
block(row,
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block
}
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//18.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =
vectorSums.t(::,1
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main method
// method to find the closest centriod to data point(
Vector
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
in the arguments) def findTheClosestCentriod(vec: Vector,
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double]
= {
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A based on
training
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
set
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over A it may
make
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sense to
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using checkpoint
api),
Post by KHATWANI PARTH BHARAT
if
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
samsara
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For
simplicity
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,
DrmLike[Int].
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
First, classic k-means starts by selecting initial
clusters,
Post by KHATWANI PARTH BHARAT
by
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1],
thus
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
forming
Post by KHATWANI PARTH BHARAT
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore
of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout's
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster
assignments
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence based
on
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
some
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
test
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current
generation
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
of A
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
and
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we
recompute
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning
centroid
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2],
[3],
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
[4]).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
efficient
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty
given
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to
reflect
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest
neighbor"
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the
bulk of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that can
speed
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
step
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start
with a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
naive
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A,
you'd
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
want
Post by Dmitriy Lyubimov
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially
average
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
row A
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1|A)'
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of
cluster
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
rows).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex
graphics.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds to DRM
expression
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would
be #
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of
all
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
points.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to
collect
Post by KHATWANI PARTH BHARAT
B
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
into
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of
it.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cause
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise
iteration
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/
mahout/blob/master/math-scala/
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho
ut/math/drm/package.scala#
Post by KHATWANI PARTH BHARAT
L149
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,
http://apache.github
Post by KHATWANI PARTH BHARAT
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the
purpose of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-
math-
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.
RLikeDrmOps
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to
move
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
ahead.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D)
which
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
you
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov
<
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
confusing?
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH
BHARAT
Post by KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering
algorithm
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
using
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
same.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-04-21 19:51:40 UTC
Permalink
@Trevor,

Following is the link for the Github Branch For the Kmeans code and Code
for the sample Program(which we are discussing above) which i am using to
figure what am i doing wrong in the Kmeans code using
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov





Thanks & Regards
Parth


On Sat, Apr 22, 2017 at 1:26 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor
In was trying to write the "*Kmeans*" Using Mahout DRM as per the
algorithm outlined by Dmitriy.
I was facing the Problem of assigning cluster Ids to the Row Keys
For Example
Consider the below matrix Where column 1 to 3 are the data points and
column 0 Containing the count of the point
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
now after calculating the centriod which closest to the data point data
zeroth index i am trying to assign the centriod index to *row key *
Now Suppose say that every data point is assigned to centriod at index 1
so after assigning the key=1 to each and every row
using the code below
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1 } (keys, block) }
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
I am confused weather assigning the new Key Values to the row index is
done through the following code line
* //assigning 1 to each row index* keys(row) = 1
or is there any other way.
I am not able to find any use links or reference on internet even Andrew
and Dmitriy's book also does not have any proper reference for the
above mentioned issue.
Thanks & Regards
Parth Khatwani
Post by Trevor Grant
OK, i dug into this before i read your question carefully, that was my bad.
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to be
{
0 => {1: 5.0} // (not 4.0) // and 6.0 in your example...
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
val drm2 = (A(::, 1 until 4) cbind 0.0).mapBlock() {
case (keys, block) =>
for(row <- 0 until block.nrow) block(row, 3) = block(row, ::).sum
(keys, block)
}
val aggTranspose = drm2(::, 3 until 4).t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)
Where we are creating an empty row, then filling it with the row sums.
A distributed rowSums fn would be nice for just such an occasion... sigh
Let me know if that gets you going again. That was simpler than I thought-
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh question, but if
you are going to collect this to the driver anyway (not sure if that is the
case)
A(::, 1 until 4).rowSums would also work.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to complete
the Data
Post by KHATWANI PARTH BHARAT
File <https://drive.google.com/open?id=0Bxnnu_Ig2Et9QjZoM3dmY1V5WXM>.
Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov
KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-Lyub
imov/KmeansMahout.scala> is
Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning new values to
the key to Row Matrix and aggregating transpose.I think assigning new
values to the key to Row Matrix and aggregating transpose is causing the
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-Lyub
imov/TestClusterAssign.scala>
Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the expected and
the
Post by KHATWANI PARTH BHARAT
actual output of the above code
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would expect
this
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
application to consume.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out with
it.
Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the above issue.
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in the
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a default
value 1
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating
transpose
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>.
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5,
6))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1
} (keys, block) } prinln("After New Cluster
assignment")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t
println("Result of
Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)
Post by KHATWANI PARTH BHARAT
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column
1,2
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy Apache
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-
Dmitriy-Lyubimov/dp/1523775785> Aggregating
Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over here
but
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
i
Post by KHATWANI PARTH BHARAT
am
Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also
Does
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
not
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of mentioned
issue.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <
+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this rolled
in to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply
that
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
effect
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a
problem
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from block cache
object
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but
specifically
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
w.r.t.
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the action
obviously
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple times on
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
same
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following
Incorrectly
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format)
to
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
calculate
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest
centriod
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoi
nt,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current
dataPoint
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of the
corresponding
Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
reference
Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial
Kmeans
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/
Spark_Mahout/tree/Dmitriy-Lyub
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
imov>
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the
broadcast
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
variable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the only thing i
can
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on github for
that
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
code
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
that
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Outline
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to
corresponding
Post by Dmitriy Lyubimov
row
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am
using
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer",
"org.apache.spark.serializer.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io
.MahoutKryoRegistrator")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
implicit val sc = new SparkDistributedContext(new
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of
double
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>
line.split('\t').map(_.toDoubl
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
e))
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double
this
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used while
calculating
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of
DenseVector
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] =
rdd.zipWithIndex.map {
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
case
Post by KHATWANI PARTH BHARAT
(v,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix =
drmWrap(rddMatrixLike)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//8.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4
and
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
use
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1)
//9.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
final
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX =
matrix(::,
Post by KHATWANI PARTH BHARAT
1
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
until
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods
val
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)
centriods.size
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val
broadCastMatrix
Post by KHATWANI PARTH BHARAT
=
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating over
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Data
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the initial
centriods
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) =>
for
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(row
Post by KHATWANI PARTH BHARAT
<-
Post by Dmitriy Lyubimov
0
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row,
::)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the closest
centriod
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val
closesetIndex
Post by KHATWANI PARTH BHARAT
=
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)
//13.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =
closesetIndex
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector
cbind
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose (1|D)'
val
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K where
n=dimension
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster *
assuming
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
3d
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the count
vectors
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
out
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,
::).collect(0,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
::))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17.
dividing
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() {
case
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(keys,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {
block(row,
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block
}
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//18.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =
vectorSums.t(::,1
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main method
// method to find the closest centriod to data point(
Vector
Vector,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
Array[Double]
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
= {
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering
Using
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
"Apache
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A based on
training
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
set
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over A it may
make
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sense to
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using
checkpoint
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
api),
Post by KHATWANI PARTH BHARAT
if
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please
refer
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
samsara
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A.
For
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simplicity
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,
DrmLike[Int].
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
First, classic k-means starts by selecting initial
clusters,
Post by KHATWANI PARTH BHARAT
by
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1],
thus
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
forming
Post by KHATWANI PARTH BHARAT
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is
therefore
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout's
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster
assignments
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence
based on
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
some
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
test
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current
generation
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
of A
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
and
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we
recompute
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by
assigning
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centroid
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2],
[3],
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
[4]).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
efficient
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are
plenty
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
given
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys
to
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
reflect
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest
neighbor"
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the
bulk of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that can
speed
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
step
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start
with a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
naive
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A,
you'd
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
want
Post by Dmitriy Lyubimov
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially
average
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
row A
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a
computation of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1|A)'
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of
cluster
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
rows).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex
graphics.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds to DRM
expression
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would
be #
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
all
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
points.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to
collect
Post by KHATWANI PARTH BHARAT
B
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
into
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of
it.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cause
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and
row-wise
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
iteration
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/
mahout/blob/master/math-scala/
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho
ut/math/drm/package.scala#
Post by KHATWANI PARTH BHARAT
L149
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,
http://apache.github
Post by KHATWANI PARTH BHARAT
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the
purpose of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/mahout
/0.10.1/docs/mahout-math-
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.
RLikeDrmOps
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to
move
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
ahead.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D)
which
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
you
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy
Lyubimov <
bit
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
confusing?
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH
BHARAT
Post by KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering
algorithm
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
using
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix
for
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
same.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
Trevor Grant
2017-04-21 19:54:03 UTC
Permalink
Got it- in short no.

Think of the keys like a dictionary or HashMap.

That's why everything is ending up on row 1.

What are you trying to achieve by creating keys of 1?

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*


On Fri, Apr 21, 2017 at 2:26 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor
In was trying to write the "*Kmeans*" Using Mahout DRM as per the algorithm
outlined by Dmitriy.
I was facing the Problem of assigning cluster Ids to the Row Keys
For Example
Consider the below matrix Where column 1 to 3 are the data points and
column 0 Containing the count of the point
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
now after calculating the centriod which closest to the data point data
zeroth index i am trying to assign the centriod index to *row key *
Now Suppose say that every data point is assigned to centriod at index 1
so after assigning the key=1 to each and every row
using the code below
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1
} (keys, block) }
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
I am confused weather assigning the new Key Values to the row index is done
through the following code line
* //assigning 1 to each row index* keys(row) = 1
or is there any other way.
I am not able to find any use links or reference on internet even Andrew
and Dmitriy's book also does not have any proper reference for the
above mentioned issue.
Thanks & Regards
Parth Khatwani
Post by Trevor Grant
OK, i dug into this before i read your question carefully, that was my
bad.
Post by Trevor Grant
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to be
{
0 => {1: 5.0} // (not 4.0) // and 6.0 in your example...
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
val drm2 = (A(::, 1 until 4) cbind 0.0).mapBlock() {
case (keys, block) =>
for(row <- 0 until block.nrow) block(row, 3) = block(row, ::).sum
(keys, block)
}
val aggTranspose = drm2(::, 3 until 4).t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)
Where we are creating an empty row, then filling it with the row sums.
A distributed rowSums fn would be nice for just such an occasion... sigh
Let me know if that gets you going again. That was simpler than I
thought-
Post by Trevor Grant
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh question, but if
you are going to collect this to the driver anyway (not sure if that is
the
Post by Trevor Grant
case)
A(::, 1 until 4).rowSums would also work.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to complete
the Data
Post by KHATWANI PARTH BHARAT
File <https://drive.google.com/open?id=0Bxnnu_Ig2Et9QjZoM3dmY1V5WXM>.
Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov
KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-
Lyubimov/KmeansMahout.scala> is
Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning new values
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the key to Row Matrix and aggregating transpose.I think assigning new
values to the key to Row Matrix and aggregating transpose is causing
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-
Lyubimov/TestClusterAssign.scala>
Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the expected and
the
Post by KHATWANI PARTH BHARAT
actual output of the above code
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant <
Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would expect
this
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
application to consume.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out
with
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
it.
Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the above issue.
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a default
value 1
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating
transpose
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov
.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4,
5,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
6))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1
} (keys, block) } prinln("After New Cluster
assignment")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t
println("Result of
Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)
Post by KHATWANI PARTH BHARAT
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column
1,2
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy Apache
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-
Dmitriy-Lyubimov/dp/1523775785> Aggregating
Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over here
but
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
i
Post by KHATWANI PARTH BHARAT
am
Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also
Does
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
not
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of mentioned
issue.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <
+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this
rolled
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply
that
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
effect
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a
problem
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from block
cache
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
object
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but
specifically
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
w.r.t.
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the action
obviously
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple times on
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
same
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following
Incorrectly
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format)
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
calculate
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest
centriod
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(
dataPoint,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current
dataPoint
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of the
corresponding
Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
reference
Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial
Kmeans
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/
Spark_Mahout/tree/Dmitriy-Lyub
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
imov>
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the
broadcast
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
variable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the only thing i
can
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on github for
that
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
code
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
that
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm
you
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Outline
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to
corresponding
Post by Dmitriy Lyubimov
row
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i
am
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
using
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer",
"org.apache.spark.serializer.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.
MahoutKryoRegistrator")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
implicit val sc = new SparkDistributedContext(new
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of
double
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>
line.split('\t').map(_.toDoubl
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
e))
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double
this
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used
while
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
calculating
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of
DenseVector
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] =
rdd.zipWithIndex.map
Post by Trevor Grant
{
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
case
Post by KHATWANI PARTH BHARAT
(v,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix =
drmWrap(rddMatrixLike)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//8.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4
and
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
use
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1)
//9.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
final
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX =
matrix(::,
Post by KHATWANI PARTH BHARAT
1
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
until
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods
val
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)
centriods.size
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val
broadCastMatrix
Post by KHATWANI PARTH BHARAT
=
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating over
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Data
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the initial
centriods
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) =>
for
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(row
Post by KHATWANI PARTH BHARAT
<-
Post by Dmitriy Lyubimov
0
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row,
::)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the closest
centriod
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val
closesetIndex
Post by KHATWANI PARTH BHARAT
=
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)
//13.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =
closesetIndex
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector
cbind
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose (1|D)'
val
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K where
n=dimension
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster *
assuming
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
3d
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the count
vectors
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
out
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,
::).collect(0,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
::))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17.
dividing
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() {
case
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(keys,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {
block(row,
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block
}
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//18.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =
vectorSums.t(::,1
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main
method
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
// method to find the closest centriod to data point(
Vector
Vector,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row,
::)))
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
Array[Double]
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
= {
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering
Using
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
"Apache
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A based on
training
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
set
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over A it may
make
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sense to
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using
checkpoint
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
api),
Post by KHATWANI PARTH BHARAT
if
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please
refer
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
samsara
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A.
For
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simplicity
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,
DrmLike[Int].
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
First, classic k-means starts by selecting initial
clusters,
Post by KHATWANI PARTH BHARAT
by
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1],
thus
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
forming
Post by KHATWANI PARTH BHARAT
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is
therefore
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout's
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster
assignments
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence
based
Post by Trevor Grant
on
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
some
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
test
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current
generation
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
of A
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
and
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once
we
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
recompute
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by
assigning
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centroid
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2],
[3],
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
[4]).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
efficient
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are
plenty
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
given
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
reflect
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a
"nearest
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
neighbor"
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the
bulk of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that can
speed
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
step
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start
with a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
naive
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A,
you'd
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
want
Post by Dmitriy Lyubimov
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially
average
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
row A
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a
computation
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1|A)'
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of
cluster
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
rows).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex
graphics.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds to
DRM
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
expression
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a
vector
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element
would
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
be #
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum
of
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
all
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
points.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to
collect
Post by KHATWANI PARTH BHARAT
B
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
into
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of
it.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cause
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and
row-wise
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
iteration
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/
mahout/blob/master/math-scala/
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho
ut/math/drm/package.scala#
Post by KHATWANI PARTH BHARAT
L149
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,
http://apache.github
Post by KHATWANI PARTH BHARAT
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
purpose of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-
math-
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.
RLikeDrmOps
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
move
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
ahead.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A :=
(0|D)
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
which
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
you
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy
Lyubimov
Post by Trevor Grant
<
bit
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
confusing?
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH
BHARAT
Post by KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering
algorithm
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
using
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix
for
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
same.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-04-21 20:06:28 UTC
Permalink
One is the cluster ID of the Index to which the data point should be
assigned.
As per what is given in this book Apache-Mahout-Mapreduce-Dmitriy-Lyubimov
<http://www.amazon.in/Apache-Mahout-Mapreduce-Dmitriy-Lyubimov/dp/1523775785>
in
chapter 4 about the aggregating Transpose.
From what i have understood is that row having the same key will added when
we take aggregating transpose of the matrix.
So i think there should be a way to assign new values to row keys and i
think Dimitriy Has also mentioned the same thing i approach he has
outlined in this mail chain
Correct me if i am wrong.


Thanks
Parth Khatwani
Post by Trevor Grant
Got it- in short no.
Think of the keys like a dictionary or HashMap.
That's why everything is ending up on row 1.
What are you trying to achieve by creating keys of 1?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Fri, Apr 21, 2017 at 2:26 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor
In was trying to write the "*Kmeans*" Using Mahout DRM as per the
algorithm
Post by KHATWANI PARTH BHARAT
outlined by Dmitriy.
I was facing the Problem of assigning cluster Ids to the Row Keys
For Example
Consider the below matrix Where column 1 to 3 are the data points and
column 0 Containing the count of the point
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
now after calculating the centriod which closest to the data point data
zeroth index i am trying to assign the centriod index to *row key *
Now Suppose say that every data point is assigned to centriod at index 1
so after assigning the key=1 to each and every row
using the code below
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1
} (keys, block) }
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
I am confused weather assigning the new Key Values to the row index is
done
Post by KHATWANI PARTH BHARAT
through the following code line
* //assigning 1 to each row index* keys(row) = 1
or is there any other way.
I am not able to find any use links or reference on internet even Andrew
and Dmitriy's book also does not have any proper reference for the
above mentioned issue.
Thanks & Regards
Parth Khatwani
Post by Trevor Grant
OK, i dug into this before i read your question carefully, that was my
bad.
Post by Trevor Grant
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to be
{
0 => {1: 5.0} // (not 4.0) // and 6.0 in your example...
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
val drm2 = (A(::, 1 until 4) cbind 0.0).mapBlock() {
case (keys, block) =>
for(row <- 0 until block.nrow) block(row, 3) = block(row, ::).sum
(keys, block)
}
val aggTranspose = drm2(::, 3 until 4).t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)
Where we are creating an empty row, then filling it with the row sums.
A distributed rowSums fn would be nice for just such an occasion...
sigh
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Let me know if that gets you going again. That was simpler than I
thought-
Post by Trevor Grant
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh question, but
if
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
you are going to collect this to the driver anyway (not sure if that is
the
Post by Trevor Grant
case)
A(::, 1 until 4).rowSums would also work.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to complete
the Data
Post by KHATWANI PARTH BHARAT
File <https://drive.google.com/open?id=0Bxnnu_Ig2Et9QjZoM3dmY1V5WXM
.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov
KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-
Lyubimov/KmeansMahout.scala> is
Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning new values
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the key to Row Matrix and aggregating transpose.I think assigning new
values to the key to Row Matrix and aggregating transpose is causing
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-
Lyubimov/TestClusterAssign.scala>
Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the expected
and
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
actual output of the above code
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column
1,2
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant <
Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would expect
this
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
application to consume.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out
with
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
it.
Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the above
issue.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a
default
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
value 1
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating
transpose
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-
Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4,
5,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
6))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until
keys.size) {
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
* //assigning 1 to each row index* keys(row)
= 1
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
} (keys, block) } prinln("After New Cluster
assignment")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t
println("Result of
Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)
Post by KHATWANI PARTH BHARAT
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and
column
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
1,2
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy
Apache
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-
Dmitriy-Lyubimov/dp/1523775785> Aggregating
Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over
here
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
but
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
i
Post by KHATWANI PARTH BHARAT
am
Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also
Does
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
not
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of
mentioned
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
issue.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <
+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this
rolled
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply
that
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
effect
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a
problem
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from block
cache
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
object
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but
specifically
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
w.r.t.
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the action
obviously
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple times
on
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
same
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following
Incorrectly
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int]
format)
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
calculate
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest
centriod
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(
dataPoint,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the
current
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
dataPoint
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
corresponding
Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
reference
Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having
Initial
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Kmeans
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/
Spark_Mahout/tree/Dmitriy-Lyub
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
imov>
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
can't say i can read this code well formatted that
way...
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it would seem to me that the code is not using the
broadcast
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
variable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the only
thing i
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
can
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on github for
that
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
code
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
that
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm
you
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Outline
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to
corresponding
Post by Dmitriy Lyubimov
row
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i
am
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
using
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer",
"org.apache.spark.serializer.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.
MahoutKryoRegistrator")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
implicit val sc = new SparkDistributedContext(new
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of
double
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>
line.split('\t').map(_.toDoubl
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
e))
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of
double
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
this
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used
while
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
calculating
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn
_)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//5. convert rdd of array of double in rdd of
DenseVector
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] =
rdd.zipWithIndex.map
Post by Trevor Grant
{
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
case
Post by KHATWANI PARTH BHARAT
(v,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix =
drmWrap(rddMatrixLike)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//8.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step
4
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
and
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
use
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1)
//9.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
final
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX =
matrix(::,
Post by KHATWANI PARTH BHARAT
1
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
until
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial
centriods
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
val
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)
centriods.size
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val
broadCastMatrix
Post by KHATWANI PARTH BHARAT
=
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating
over
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Data
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the
initial
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
centriods
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) =>
for
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(row
Post by KHATWANI PARTH BHARAT
<-
Post by Dmitriy Lyubimov
0
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint =
block(row,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
::)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the closest
centriod
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val
closesetIndex
Post by KHATWANI PARTH BHARAT
=
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)
//13.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =
closesetIndex
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b =
(oneVector
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
cbind
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose (1|D)'
val
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K
where
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n=dimension
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth
row
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster *
assuming
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
3d
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the
count
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
vectors
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
out
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,
::).collect(0,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
::))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17.
dividing
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() {
case
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(keys,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {
block(row,
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block
}
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//18.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =
vectorSums.t(::,1
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main
method
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
// method to find the closest centriod to data
point(
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Vector
Vector,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row,
::)))
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
Array[Double]
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
= {
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering
Using
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
"Apache
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A based on
training
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
set
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over A it may
make
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sense to
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using
checkpoint
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
api),
Post by KHATWANI PARTH BHARAT
if
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please
refer
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
samsara
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A.
For
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simplicity
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,
DrmLike[Int].
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
First, classic k-means starts by selecting initial
clusters,
Post by KHATWANI PARTH BHARAT
by
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api
[1],
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
thus
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
forming
Post by KHATWANI PARTH BHARAT
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is
therefore
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout's
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster
assignments
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence
based
Post by Trevor Grant
on
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
some
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
test
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current
generation
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
of A
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
and
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once
we
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
recompute
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by
assigning
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centroid
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in
[2],
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
[3],
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
[4]).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access
it
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
efficient
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are
plenty
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
given
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row
keys
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
reflect
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a
"nearest
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
neighbor"
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
bulk of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that can
speed
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
step
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can
start
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
with a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
naive
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix
A,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
you'd
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
want
Post by Dmitriy Lyubimov
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially
average
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
row A
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a
computation
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1|A)'
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of
cluster
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
rows).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex
graphics.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds to
DRM
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
expression
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a
vector
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element
would
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
be #
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to
sum
Post by KHATWANI PARTH BHARAT
of
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
all
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
points.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to
collect
Post by KHATWANI PARTH BHARAT
B
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
into
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
it.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements,
this
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cause
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and
row-wise
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
iteration
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/
mahout/blob/master/math-scala/
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho
ut/math/drm/package.scala#
Post by KHATWANI PARTH BHARAT
L149
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,
http://apache.github
Post by KHATWANI PARTH BHARAT
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
purpose of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/
mahout/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
math-
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.
RLikeDrmOps
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
move
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
ahead.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A :=
(0|D)
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
which
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
you
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy
Lyubimov
Post by Trevor Grant
<
bit
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
confusing?
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI
PARTH
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
BHARAT
Post by KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering
algorithm
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
using
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix
for
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
same.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
Dmitriy Lyubimov
2017-04-21 23:50:42 UTC
Permalink
There appears to be a bug in Spark transposition operator w.r.t.
aggregating semantics which appears in cases where the same cluster (key)
is present more than once in the same block. The fix is one character long
(+ better test for aggregation).



On Fri, Apr 21, 2017 at 1:06 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
One is the cluster ID of the Index to which the data point should be
assigned.
As per what is given in this book Apache-Mahout-Mapreduce-Dmitriy-Lyubimov
<http://www.amazon.in/Apache-Mahout-Mapreduce-Dmitriy-
Lyubimov/dp/1523775785>
in
chapter 4 about the aggregating Transpose.
From what i have understood is that row having the same key will added when
we take aggregating transpose of the matrix.
So i think there should be a way to assign new values to row keys and i
think Dimitriy Has also mentioned the same thing i approach he has
outlined in this mail chain
Correct me if i am wrong.
Thanks
Parth Khatwani
Post by Trevor Grant
Got it- in short no.
Think of the keys like a dictionary or HashMap.
That's why everything is ending up on row 1.
What are you trying to achieve by creating keys of 1?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Fri, Apr 21, 2017 at 2:26 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor
In was trying to write the "*Kmeans*" Using Mahout DRM as per the
algorithm
Post by KHATWANI PARTH BHARAT
outlined by Dmitriy.
I was facing the Problem of assigning cluster Ids to the Row Keys
For Example
Consider the below matrix Where column 1 to 3 are the data points and
column 0 Containing the count of the point
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
now after calculating the centriod which closest to the data point
data
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
zeroth index i am trying to assign the centriod index to *row key *
Now Suppose say that every data point is assigned to centriod at index
1
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
so after assigning the key=1 to each and every row
using the code below
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1
} (keys, block) }
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
I am confused weather assigning the new Key Values to the row index is
done
Post by KHATWANI PARTH BHARAT
through the following code line
* //assigning 1 to each row index* keys(row) = 1
or is there any other way.
I am not able to find any use links or reference on internet even
Andrew
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
and Dmitriy's book also does not have any proper reference for the
above mentioned issue.
Thanks & Regards
Parth Khatwani
On Fri, Apr 21, 2017 at 10:06 PM, Trevor Grant <
Post by Trevor Grant
OK, i dug into this before i read your question carefully, that was
my
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
bad.
Post by Trevor Grant
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to be
{
0 => {1: 5.0} // (not 4.0) // and 6.0 in your example...
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
val drm2 = (A(::, 1 until 4) cbind 0.0).mapBlock() {
case (keys, block) =>
for(row <- 0 until block.nrow) block(row, 3) = block(row, ::).sum
(keys, block)
}
val aggTranspose = drm2(::, 3 until 4).t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)
Where we are creating an empty row, then filling it with the row
sums.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
A distributed rowSums fn would be nice for just such an occasion...
sigh
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Let me know if that gets you going again. That was simpler than I
thought-
Post by Trevor Grant
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh question,
but
Post by Trevor Grant
if
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
you are going to collect this to the driver anyway (not sure if that
is
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
case)
A(::, 1 until 4).rowSums would also work.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to
complete
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the Data
Post by KHATWANI PARTH BHARAT
File <https://drive.google.com/open?id=0Bxnnu_
Ig2Et9QjZoM3dmY1V5WXM
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov
KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-
Lyubimov/KmeansMahout.scala> is
Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning new
values
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the key to Row Matrix and aggregating transpose.I think assigning
new
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
values to the key to Row Matrix and aggregating transpose is
causing
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-
Lyubimov/TestClusterAssign.scala>
Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the expected
and
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
actual output of the above code
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column
1,2
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant <
Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would
expect
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
this
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
application to consume.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out
with
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
it.
Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the above
issue.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned
in
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a
default
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
value 1
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating
transpose
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-
Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5),
(1,4,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
5,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
6))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until
keys.size) {
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
* //assigning 1 to each row index* keys(row)
= 1
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
} (keys, block) } prinln("After New Cluster
assignment")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t
println("Result of
Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)
Post by KHATWANI PARTH BHARAT
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and
column
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
1,2
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy
Apache
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-
Dmitriy-Lyubimov/dp/1523775785> Aggregating
Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over
here
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
but
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
i
Post by KHATWANI PARTH BHARAT
am
Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html
Also
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Does
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
not
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of
mentioned
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
issue.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <
+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this
rolled
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts
imply
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
that
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
effect
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a
problem
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from block
cache
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
object
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but
specifically
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
w.r.t.
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the
action
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
obviously
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple times
on
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
same
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the
following
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Incorrectly
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of
DRM
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
//11. Iterating over the Data Matrix(in DrmLike[Int]
format)
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
calculate
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest
centriod
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(
dataPoint,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the
current
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
dataPoint
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
corresponding
Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the
same
Post by Trevor Grant
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
reference
Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having
Initial
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Kmeans
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/
Spark_Mahout/tree/Dmitriy-Lyub
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
imov>
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering
Using
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
"Apache
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
can't say i can read this code well formatted that
way...
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it would seem to me that the code is not using the
broadcast
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
variable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the only
thing i
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
can
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on github
for
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
that
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
code
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
that
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH
BHARAT
Post by Trevor Grant
<
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the
algorithm
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
you
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Outline
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to
corresponding
Post by Dmitriy Lyubimov
row
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may
be i
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
am
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
using
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer",
"org.apache.spark.serializer.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.
MahoutKryoRegistrator")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
implicit val sc = new
SparkDistributedContext(new
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array
of
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
double
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>
line.split('\t').map(_.toDoubl
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
e))
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of
double
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
this
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used
while
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
calculating
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn
_)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//5. convert rdd of array of double in rdd of
DenseVector
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] =
rdd.zipWithIndex.map
Post by Trevor Grant
{
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
case
Post by KHATWANI PARTH BHARAT
(v,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
CheckpointedDrm[Int] val matrix =
drmWrap(rddMatrixLike)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//8.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in
step
Post by Trevor Grant
4
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
and
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
use
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1)
//9.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
final
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX =
matrix(::,
Post by KHATWANI PARTH BHARAT
1
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
until
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial
centriods
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
val
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)
centriods.size
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val
broadCastMatrix
Post by KHATWANI PARTH BHARAT
=
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating
over
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Data
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the
initial
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
centriods
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) =>
for
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(row
Post by KHATWANI PARTH BHARAT
<-
Post by Dmitriy Lyubimov
0
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint =
block(row,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
::)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the
closest
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
centriod
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val
closesetIndex
Post by KHATWANI PARTH BHARAT
=
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)
//13.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =
closesetIndex
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b =
(oneVector
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
cbind
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose (1|D)'
val
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K
where
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n=dimension
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth
row
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster *
assuming
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
3d
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the
count
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
vectors
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
out
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,
::).collect(0,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
::))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17.
dividing
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() {
case
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(keys,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {
block(row,
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys ->
block
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
}
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//18.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =
vectorSums.t(::,1
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main
method
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
// method to find the closest centriod to data
point(
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Vector
Vector,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row,
::)))
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
Array[Double]
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
= {
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH
BHARAT
Post by Trevor Grant
<
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering
Using
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
"Apache
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A based
on
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
training
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
set
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over A it
may
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
make
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sense to
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using
checkpoint
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
api),
Post by KHATWANI PARTH BHARAT
if
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy
Lyubimov <
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please
refer
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
samsara
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix
A.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
For
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simplicity
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,
DrmLike[Int].
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
First, classic k-means starts by selecting
initial
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
clusters,
Post by KHATWANI PARTH BHARAT
by
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling api
[1],
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
thus
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
forming
Post by KHATWANI PARTH BHARAT
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is
therefore
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout's
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster
assignments
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence
based
Post by Trevor Grant
on
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
some
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
test
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your
choice.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Cluster assignments: here, we go over current
generation
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
of A
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
and
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A.
Once
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
we
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
recompute
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by
assigning
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centroid
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in
[2],
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
[3],
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
[4]).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to
access
Post by Trevor Grant
it
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
efficient
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are
plenty
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
given
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row
keys
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
reflect
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a
"nearest
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
neighbor"
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
bulk of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that
can
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
speed
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
step
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can
start
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
with a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
naive
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix
A,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
you'd
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
want
Post by Dmitriy Lyubimov
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute
essentially
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
average
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
row A
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a
computation
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1|A)'
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of
cluster
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
rows).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a latex
graphics.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
DRM
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
expression
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a
vector
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element
would
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
be #
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to
sum
Post by KHATWANI PARTH BHARAT
of
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
all
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
points.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
collect
Post by KHATWANI PARTH BHARAT
B
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
into
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
it.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements,
this
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cause
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and
row-wise
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
iteration
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/
mahout/blob/master/math-scala/
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho
ut/math/drm/package.scala#
Post by KHATWANI PARTH BHARAT
L149
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,
http://apache.github
Post by KHATWANI PARTH BHARAT
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable
for
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
purpose of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/
mahout/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
math-
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc.
http://apache.github.io/mahout
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.
apache.mahout.math.drm.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
RLikeDrmOps
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the
approach
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
move
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
ahead.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI
PARTH
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A :=
(0|D)
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
which
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
you
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy
Lyubimov
Post by Trevor Grant
<
been a
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
bit
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
confusing?
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI
PARTH
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
BHARAT
Post by KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering
algorithm
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
using
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row
Matrix
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
for
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
same.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-04-22 04:25:11 UTC
Permalink
@Dmitriy
I didn't get this "The fix is one character long
(+ better test for aggregation)."
And even before Aggregating Transpose I Trying to assign Cluster IDs to Row
Key
Which doesn't seems to work.

I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this when i assign 1 to each every Row key
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}

From what i have understood is that even before doing the aggregating Trans
pose the Matrix should be in the following format
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}

only then Rows with same Key will be added.


Correct me if i am wrong.


Thanks
Parth Khatwani
Post by Dmitriy Lyubimov
There appears to be a bug in Spark transposition operator w.r.t.
aggregating semantics which appears in cases where the same cluster (key)
is present more than once in the same block. The fix is one character long
(+ better test for aggregation).
On Fri, Apr 21, 2017 at 1:06 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
One is the cluster ID of the Index to which the data point should be
assigned.
As per what is given in this book Apache-Mahout-Mapreduce-
Dmitriy-Lyubimov
Post by KHATWANI PARTH BHARAT
<http://www.amazon.in/Apache-Mahout-Mapreduce-Dmitriy-
Lyubimov/dp/1523775785>
in
chapter 4 about the aggregating Transpose.
From what i have understood is that row having the same key will added
when
Post by KHATWANI PARTH BHARAT
we take aggregating transpose of the matrix.
So i think there should be a way to assign new values to row keys and i
think Dimitriy Has also mentioned the same thing i approach he has
outlined in this mail chain
Correct me if i am wrong.
Thanks
Parth Khatwani
Post by Trevor Grant
Got it- in short no.
Think of the keys like a dictionary or HashMap.
That's why everything is ending up on row 1.
What are you trying to achieve by creating keys of 1?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Fri, Apr 21, 2017 at 2:26 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor
In was trying to write the "*Kmeans*" Using Mahout DRM as per the
algorithm
Post by KHATWANI PARTH BHARAT
outlined by Dmitriy.
I was facing the Problem of assigning cluster Ids to the Row Keys
For Example
Consider the below matrix Where column 1 to 3 are the data points and
column 0 Containing the count of the point
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
now after calculating the centriod which closest to the data point
data
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
zeroth index i am trying to assign the centriod index to *row key *
Now Suppose say that every data point is assigned to centriod at
index
Post by KHATWANI PARTH BHARAT
1
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
so after assigning the key=1 to each and every row
using the code below
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1
} (keys, block) }
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
I am confused weather assigning the new Key Values to the row index
is
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
done
Post by KHATWANI PARTH BHARAT
through the following code line
* //assigning 1 to each row index* keys(row) = 1
or is there any other way.
I am not able to find any use links or reference on internet even
Andrew
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
and Dmitriy's book also does not have any proper reference for the
above mentioned issue.
Thanks & Regards
Parth Khatwani
On Fri, Apr 21, 2017 at 10:06 PM, Trevor Grant <
Post by Trevor Grant
OK, i dug into this before i read your question carefully, that was
my
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
bad.
Post by Trevor Grant
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to be
{
0 => {1: 5.0} // (not 4.0) // and 6.0 in your example...
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
val drm2 = (A(::, 1 until 4) cbind 0.0).mapBlock() {
case (keys, block) =>
for(row <- 0 until block.nrow) block(row, 3) = block(row,
::).sum
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
(keys, block)
}
val aggTranspose = drm2(::, 3 until 4).t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)
Where we are creating an empty row, then filling it with the row
sums.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
A distributed rowSums fn would be nice for just such an occasion...
sigh
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Let me know if that gets you going again. That was simpler than I
thought-
Post by Trevor Grant
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh question,
but
Post by Trevor Grant
if
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
you are going to collect this to the driver anyway (not sure if
that
Post by KHATWANI PARTH BHARAT
is
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
case)
A(::, 1 until 4).rowSums would also work.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to
complete
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the Data
Post by KHATWANI PARTH BHARAT
File <https://drive.google.com/open?id=0Bxnnu_
Ig2Et9QjZoM3dmY1V5WXM
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov
KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-
Lyubimov/KmeansMahout.scala> is
Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning new
values
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the key to Row Matrix and aggregating transpose.I think assigning
new
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
values to the key to Row Matrix and aggregating transpose is
causing
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-
Lyubimov/TestClusterAssign.scala>
Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the
expected
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
and
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
actual output of the above code
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and
column
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
1,2
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant <
Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would
expect
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
this
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
application to consume.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me
out
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
with
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
it.
Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the above
issue.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned
in
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a
default
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
value 1
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating
transpose
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
I have committed the separate test code to the Github
Branch
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-
Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5),
(1,4,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
5,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
6))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until
keys.size) {
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
* //assigning 1 to each row index*
keys(row)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
= 1
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
} (keys, block) } prinln("After New Cluster
assignment")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t
println("Result of
Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)
Post by KHATWANI PARTH BHARAT
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and
column
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
1,2
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy
Apache
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-
Dmitriy-Lyubimov/dp/1523775785> Aggregating
Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over
here
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
but
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
i
Post by KHATWANI PARTH BHARAT
am
Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html
Also
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Does
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
not
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of
mentioned
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
issue.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <
+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this
rolled
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
-Virgil*
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts
imply
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
that
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
effect
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a
problem
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from
block
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
cache
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
object
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but
specifically
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
w.r.t.
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the
action
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
obviously
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple
times
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
on
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
same
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the
following
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Incorrectly
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of
DRM
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
//11. Iterating over the Data Matrix(in DrmLike[Int]
format)
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
calculate
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest
centriod
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(
dataPoint,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the
current
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
dataPoint
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
corresponding
Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the
same
Post by Trevor Grant
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
reference
Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having
Initial
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Kmeans
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/
Spark_Mahout/tree/Dmitriy-Lyub
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
imov>
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering
Using
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
"Apache
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
can't say i can read this code well formatted that
way...
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it would seem to me that the code is not using the
broadcast
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
variable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the only
thing i
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
can
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on github
for
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
that
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
code
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
that
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH
BHARAT
Post by Trevor Grant
<
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the
algorithm
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
you
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Outline
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index
to
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
corresponding
Post by Dmitriy Lyubimov
row
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may
be i
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
am
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
using
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer",
"org.apache.spark.serializer.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.
MahoutKryoRegistrator")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
implicit val sc = new
SparkDistributedContext(new
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to
array
Post by KHATWANI PARTH BHARAT
of
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
double
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>
line.split('\t').map(_.toDoubl
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
e))
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of
double
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
this
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be
used
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
while
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
calculating
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray =
test.map(addCentriodColumn
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
_)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//5. convert rdd of array of double in rdd of
DenseVector
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] =
rdd.zipWithIndex.map
Post by Trevor Grant
{
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
case
Post by KHATWANI PARTH BHARAT
(v,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert
DrmRdd
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
CheckpointedDrm[Int] val matrix =
drmWrap(rddMatrixLike)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//8.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in
step
Post by Trevor Grant
4
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
and
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
use
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1)
//9.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
final
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX
=
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
matrix(::,
Post by KHATWANI PARTH BHARAT
1
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
until
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial
centriods
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
val
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)
centriods.size
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val
broadCastMatrix
Post by KHATWANI PARTH BHARAT
=
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating
over
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Data
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the
initial
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
centriods
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) =>
for
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(row
Post by KHATWANI PARTH BHARAT
<-
Post by Dmitriy Lyubimov
0
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint =
block(row,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
::)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the
closest
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
centriod
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val
closesetIndex
Post by KHATWANI PARTH BHARAT
=
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)
//13.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row)
=
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
closesetIndex
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b =
(oneVector
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
cbind
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose
(1|D)'
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
val
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K
where
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n=dimension
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters *
zeroth
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
row
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster *
assuming
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
3d
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the
count
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
vectors
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
out
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,
::).collect(0,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
::))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17.
dividing
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() {
case
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(keys,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {
block(row,
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys ->
block
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
}
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//18.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =
vectorSums.t(::,1
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate
over
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main
method
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
// method to find the closest centriod to data
point(
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Vector
Vector,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec,
matrix(row,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
::)))
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance
between
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
Array[Double]
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
= {
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH
BHARAT
Post by Trevor Grant
<
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans
Clustering
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Using
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
"Apache
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A
based
Post by KHATWANI PARTH BHARAT
on
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
training
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
set
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over A it
may
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
make
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sense to
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using
checkpoint
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
api),
Post by KHATWANI PARTH BHARAT
if
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy
Lyubimov <
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
here is the outline. For details of APIs,
please
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
refer
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
samsara
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n
matrix
Post by KHATWANI PARTH BHARAT
A.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
For
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simplicity
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,
DrmLike[Int].
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
First, classic k-means starts by selecting
initial
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
clusters,
Post by KHATWANI PARTH BHARAT
by
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling
api
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
[1],
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
thus
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
forming
Post by KHATWANI PARTH BHARAT
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is
therefore
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout's
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster
assignments
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompupting centroid matrix C till
convergence
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
based
Post by Trevor Grant
on
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
some
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
test
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your
choice.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Cluster assignments: here, we go over current
generation
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
of A
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
and
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A.
Once
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
we
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
recompute
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by
assigning
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centroid
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
[2],
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
[3],
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
[4]).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to
access
Post by Trevor Grant
it
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
efficient
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that
are
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
plenty
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
given
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row
keys
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
reflect
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a
"nearest
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
neighbor"
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This
is
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
bulk of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there that
can
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
speed
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
step
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you can
start
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
with a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
naive
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of
marix
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
A,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
you'd
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
want
Post by Dmitriy Lyubimov
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute
essentially
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
average
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
row A
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a
computation
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1|A)'
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
cluster
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
rows).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a
latex
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
graphics.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
DRM
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
expression
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a
vector
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first
element
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
would
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
be #
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond
to
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
sum
Post by KHATWANI PARTH BHARAT
of
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
all
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
points.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we
need
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
collect
Post by KHATWANI PARTH BHARAT
B
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
into
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the
rest
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
it.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements,
this
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cause
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and
row-wise
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
iteration
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/
mahout/blob/master/math-scala/
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho
ut/math/drm/package.scala#
Post by KHATWANI PARTH BHARAT
L149
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,
http://apache.github
Post by KHATWANI PARTH BHARAT
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable
for
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
purpose of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/
mahout/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
math-
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc.
http://apache.github.io/mahout
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.
apache.mahout.math.drm.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
RLikeDrmOps
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI
PARTH
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the
approach
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
move
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
ahead.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI
PARTH
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way
ahead.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Like how to create the augmented matrix A
:=
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(0|D)
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
which
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
you
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy
Lyubimov
Post by Trevor Grant
<
been a
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
bit
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
confusing?
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI
PARTH
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
BHARAT
Post by KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans
clustering
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
algorithm
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
using
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row
Matrix
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
for
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
same.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
Trevor Grant
2017-04-22 18:42:32 UTC
Permalink
In short Khatwani, you found a bug!

They creep in from time to time. Thank you and sorry for the inconvenience.

You'll find
https://issues.apache.org/jira/browse/MAHOUT-1971

and subsequent PR
https://github.com/apache/mahout/pull/307/files

addressing this issue.

Wait for these to close and then try building mahout again with mvn clean
install.

Your code will hopefully work then.

tg


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*


On Fri, Apr 21, 2017 at 11:25 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy
I didn't get this "The fix is one character long
(+ better test for aggregation)."
And even before Aggregating Transpose I Trying to assign Cluster IDs to Row
Key
Which doesn't seems to work.
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this when i assign 1 to each every Row key
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
From what i have understood is that even before doing the aggregating Trans
pose the Matrix should be in the following format
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
only then Rows with same Key will be added.
Correct me if i am wrong.
Thanks
Parth Khatwani
Post by Dmitriy Lyubimov
There appears to be a bug in Spark transposition operator w.r.t.
aggregating semantics which appears in cases where the same cluster (key)
is present more than once in the same block. The fix is one character
long
Post by Dmitriy Lyubimov
(+ better test for aggregation).
On Fri, Apr 21, 2017 at 1:06 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
One is the cluster ID of the Index to which the data point should be
assigned.
As per what is given in this book Apache-Mahout-Mapreduce-
Dmitriy-Lyubimov
Post by KHATWANI PARTH BHARAT
<http://www.amazon.in/Apache-Mahout-Mapreduce-Dmitriy-
Lyubimov/dp/1523775785>
in
chapter 4 about the aggregating Transpose.
From what i have understood is that row having the same key will added
when
Post by KHATWANI PARTH BHARAT
we take aggregating transpose of the matrix.
So i think there should be a way to assign new values to row keys and
i
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
think Dimitriy Has also mentioned the same thing i approach he has
outlined in this mail chain
Correct me if i am wrong.
Thanks
Parth Khatwani
On Sat, Apr 22, 2017 at 1:54 AM, Trevor Grant <
Post by Trevor Grant
Got it- in short no.
Think of the keys like a dictionary or HashMap.
That's why everything is ending up on row 1.
What are you trying to achieve by creating keys of 1?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
On Fri, Apr 21, 2017 at 2:26 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor
In was trying to write the "*Kmeans*" Using Mahout DRM as per the
algorithm
Post by KHATWANI PARTH BHARAT
outlined by Dmitriy.
I was facing the Problem of assigning cluster Ids to the Row Keys
For Example
Consider the below matrix Where column 1 to 3 are the data points
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
column 0 Containing the count of the point
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
now after calculating the centriod which closest to the data point
data
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
zeroth index i am trying to assign the centriod index to *row key *
Now Suppose say that every data point is assigned to centriod at
index
Post by KHATWANI PARTH BHARAT
1
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
so after assigning the key=1 to each and every row
using the code below
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1
} (keys, block) }
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
I am confused weather assigning the new Key Values to the row index
is
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
done
Post by KHATWANI PARTH BHARAT
through the following code line
* //assigning 1 to each row index* keys(row) = 1
or is there any other way.
I am not able to find any use links or reference on internet even
Andrew
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
and Dmitriy's book also does not have any proper reference for the
above mentioned issue.
Thanks & Regards
Parth Khatwani
On Fri, Apr 21, 2017 at 10:06 PM, Trevor Grant <
Post by Trevor Grant
OK, i dug into this before i read your question carefully, that
was
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
my
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
bad.
Post by Trevor Grant
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to be
{
0 => {1: 5.0} // (not 4.0) // and 6.0 in your example...
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
val drm2 = (A(::, 1 until 4) cbind 0.0).mapBlock() {
case (keys, block) =>
for(row <- 0 until block.nrow) block(row, 3) = block(row,
::).sum
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
(keys, block)
}
val aggTranspose = drm2(::, 3 until 4).t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)
Where we are creating an empty row, then filling it with the row
sums.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
A distributed rowSums fn would be nice for just such an
occasion...
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
sigh
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Let me know if that gets you going again. That was simpler than
I
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
thought-
Post by Trevor Grant
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh
question,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
but
Post by Trevor Grant
if
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
you are going to collect this to the driver anyway (not sure if
that
Post by KHATWANI PARTH BHARAT
is
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
case)
A(::, 1 until 4).rowSums would also work.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to
complete
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the Data
Post by KHATWANI PARTH BHARAT
File <https://drive.google.com/open?id=0Bxnnu_
Ig2Et9QjZoM3dmY1V5WXM
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-
Lyubimov
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-
Lyubimov/KmeansMahout.scala> is
Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning new
values
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the key to Row Matrix and aggregating transpose.I think
assigning
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
new
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
values to the key to Row Matrix and aggregating transpose is
causing
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-
Lyubimov/TestClusterAssign.scala>
Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the
expected
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
and
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
actual output of the above code
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and
column
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
1,2
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant <
Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would
expect
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
this
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
application to consume.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me
out
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
with
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
it.
Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the above
issue.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have
mentioned
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
in
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a
default
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
value 1
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating
transpose
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
I have committed the separate test code to the Github
Branch
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-
Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5),
(1,4,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
5,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
6))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until
keys.size) {
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
* //assigning 1 to each row index*
keys(row)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
= 1
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
} (keys, block) } prinln("After New Cluster
assignment")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t
println("Result of
Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)
Post by KHATWANI PARTH BHARAT
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
column
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
1,2
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy
Apache
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-
Dmitriy-Lyubimov/dp/1523775785> Aggregating
Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely
over
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
here
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
but
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
i
Post by KHATWANI PARTH BHARAT
am
Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.
html
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Also
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Does
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
not
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of
mentioned
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
issue.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <
+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using
"Apache
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting
this
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
rolled
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
-Virgil*
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
i would think reassinging keys should work in most
cases.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
The only exception is that technically Spark contracts
imply
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
that
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
effect
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might
be a
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
problem
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from
block
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
cache
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
object
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but
specifically
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
w.r.t.
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the
action
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
obviously
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple
times
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
on
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
same
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT
<
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the
following
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Incorrectly
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys
of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
DRM
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
//11. Iterating over the Data Matrix(in DrmLike[Int]
format)
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
calculate
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the
closest
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
centriod
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(
dataPoint,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the
current
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
dataPoint
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the
key
Post by Dmitriy Lyubimov
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
corresponding
Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the
same
Post by Trevor Grant
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
reference
Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH
BHARAT
Post by Dmitriy Lyubimov
<
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having
Initial
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Kmeans
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/
Spark_Mahout/tree/Dmitriy-Lyub
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
imov>
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering
Using
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
"Apache
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
can't say i can read this code well formatted that
way...
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it would seem to me that the code is not using the
broadcast
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
variable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the only
thing i
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
can
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on
github
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
for
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
that
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
code
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
that
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH
BHARAT
Post by Trevor Grant
<
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the
algorithm
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
you
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Outline
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index
to
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
corresponding
Post by Dmitriy Lyubimov
row
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11
may
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
be i
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
am
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
using
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout
context
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer",
"org.apache.spark.serializer.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.
MahoutKryoRegistrator")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
implicit val sc = new
SparkDistributedContext(new
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the
rdd
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val lines = sc.textFile(args(1))
//3. convert data read in as string in to
array
Post by KHATWANI PARTH BHARAT
of
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
double
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>
line.split('\t').map(_.toDoubl
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
e))
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of
double
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
this
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be
used
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
while
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
calculating
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray =
test.map(addCentriodColumn
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
_)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//5. convert rdd of array of double in rdd
of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
DenseVector
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] =
rdd.zipWithIndex.map
Post by Trevor Grant
{
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
case
Post by KHATWANI PARTH BHARAT
(v,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert
DrmRdd
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
CheckpointedDrm[Int] val matrix =
drmWrap(rddMatrixLike)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//8.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in
step
Post by Trevor Grant
4
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
and
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
use
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until
1)
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
//9.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
final
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val
dataDrmX
Post by Dmitriy Lyubimov
=
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
matrix(::,
Post by KHATWANI PARTH BHARAT
1
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
until
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial
centriods
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
val
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)
centriods.size
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val
broadCastMatrix
Post by KHATWANI PARTH BHARAT
=
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11.
Iterating
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
over
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Data
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the
initial
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
centriods
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) =>
for
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(row
Post by KHATWANI PARTH BHARAT
<-
Post by Dmitriy Lyubimov
0
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint =
block(row,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
::)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the
closest
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
centriod
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val
closesetIndex
Post by KHATWANI PARTH BHARAT
=
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)
//13.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
assigning closest index to key
keys(row)
Post by Dmitriy Lyubimov
=
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
closesetIndex
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b =
(oneVector
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
cbind
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose
(1|D)'
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
val
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step
15
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
have data in the following format
/*(n+1)*K
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
where
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n=dimension
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters *
zeroth
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
row
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster
*
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
assuming
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
3d
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing
the
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
count
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
vectors
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
out
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until
1,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
::).collect(0,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
::))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::)
//17.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
dividing
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() {
case
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(keys,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until
block.nrow) {
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
block(row,
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys ->
block
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
}
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//18.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the count vectors val
newCentriods =
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
vectorSums.t(::,1
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate
over
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of
main
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
method
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
// method to find the closest centriod to data
point(
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Vector
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
in the arguments) def
Vector,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec,
matrix(row,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
::)))
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance
between
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
Array[Double]
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
= {
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val newArr = new Array[Double](arg.length +
1)
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH
BHARAT
Post by Trevor Grant
<
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans
Clustering
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Using
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
"Apache
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A
based
Post by KHATWANI PARTH BHARAT
on
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
training
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
set
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over A
it
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
may
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
make
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sense to
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by using
checkpoint
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
api),
Post by KHATWANI PARTH BHARAT
if
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy
Lyubimov <
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
here is the outline. For details of APIs,
please
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
refer
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
samsara
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n
matrix
Post by KHATWANI PARTH BHARAT
A.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
For
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simplicity
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,
DrmLike[Int].
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
First, classic k-means starts by selecting
initial
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
clusters,
Post by KHATWANI PARTH BHARAT
by
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using sampling
api
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
[1],
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
thus
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
forming
Post by KHATWANI PARTH BHARAT
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is
therefore
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout's
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between
cluster
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
assignments
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompupting centroid matrix C till
convergence
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
based
Post by Trevor Grant
on
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
some
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
test
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your
choice.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Cluster assignments: here, we go over
current
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
generation
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
of A
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
and
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompute centroid indexes for each row in
A.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Once
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
we
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
recompute
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that by
assigning
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centroid
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
[2],
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
[3],
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
[4]).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to
access
Post by Trevor Grant
it
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
efficient
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that
are
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
plenty
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
given
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the
row
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
keys
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
reflect
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a
"nearest
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
neighbor"
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This
is
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
bulk of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there
that
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
can
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
speed
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
step
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you
can
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
start
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
with a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
naive
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of
marix
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
A,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
you'd
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
want
Post by Dmitriy Lyubimov
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute
essentially
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
average
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
row A
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a
computation
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1|A)'
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape
(Counts/sums
Post by Dmitriy Lyubimov
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
cluster
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
rows).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a
latex
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
graphics.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)'
corresponds
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
DRM
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
expression
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column
contains a
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
vector
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first
element
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
would
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
be #
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond
to
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
sum
Post by KHATWANI PARTH BHARAT
of
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
all
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
points.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we
need
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
collect
Post by KHATWANI PARTH BHARAT
B
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
into
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from the
rest
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
it.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0
elements,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
this
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cause
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed
C).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
This operation obviously uses subblocking
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
row-wise
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
iteration
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to
[2].
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[1] https://github.com/apache/
mahout/blob/master/math-scala/
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho
ut/math/drm/package.scala#
Post by KHATWANI PARTH BHARAT
L149
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,
http://apache.github
Post by KHATWANI PARTH BHARAT
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely
viable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
for
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
purpose of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/
mahout/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
math-
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc.
http://apache.github.io/mahout
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.
apache.mahout.math.drm.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
RLikeDrmOps
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI
PARTH
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the
approach
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
move
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
ahead.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI
PARTH
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way
ahead.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Like how to create the augmented matrix A
:=
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(0|D)
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
which
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
you
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy
Lyubimov
Post by Trevor Grant
<
been a
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
bit
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
confusing?
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM,
KHATWANI
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
PARTH
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
BHARAT
Post by KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans
clustering
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
algorithm
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
using
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row
Matrix
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
for
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
same.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-04-22 19:21:39 UTC
Permalink
Thanks Tervor.

I will wait for the bug fix.
Post by Trevor Grant
In short Khatwani, you found a bug!
They creep in from time to time. Thank you and sorry for the
inconvenience.
You'll find
https://issues.apache.org/jira/browse/MAHOUT-1971
and subsequent PR
https://github.com/apache/mahout/pull/307/files
addressing this issue.
Wait for these to close and then try building mahout again with mvn clean
install.
Your code will hopefully work then.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Fri, Apr 21, 2017 at 11:25 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy
I didn't get this "The fix is one character long
(+ better test for aggregation)."
And even before Aggregating Transpose I Trying to assign Cluster IDs to
Row
Post by KHATWANI PARTH BHARAT
Key
Which doesn't seems to work.
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this when i assign 1 to each every Row key
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
From what i have understood is that even before doing the aggregating
Trans
Post by KHATWANI PARTH BHARAT
pose the Matrix should be in the following format
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
only then Rows with same Key will be added.
Correct me if i am wrong.
Thanks
Parth Khatwani
Post by Dmitriy Lyubimov
There appears to be a bug in Spark transposition operator w.r.t.
aggregating semantics which appears in cases where the same cluster
(key)
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
is present more than once in the same block. The fix is one character
long
Post by Dmitriy Lyubimov
(+ better test for aggregation).
On Fri, Apr 21, 2017 at 1:06 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
One is the cluster ID of the Index to which the data point should be
assigned.
As per what is given in this book Apache-Mahout-Mapreduce-
Dmitriy-Lyubimov
Post by KHATWANI PARTH BHARAT
<http://www.amazon.in/Apache-Mahout-Mapreduce-Dmitriy-
Lyubimov/dp/1523775785>
in
chapter 4 about the aggregating Transpose.
From what i have understood is that row having the same key will
added
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
when
Post by KHATWANI PARTH BHARAT
we take aggregating transpose of the matrix.
So i think there should be a way to assign new values to row keys
and
Post by KHATWANI PARTH BHARAT
i
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
think Dimitriy Has also mentioned the same thing i approach he has
outlined in this mail chain
Correct me if i am wrong.
Thanks
Parth Khatwani
On Sat, Apr 22, 2017 at 1:54 AM, Trevor Grant <
Post by Trevor Grant
Got it- in short no.
Think of the keys like a dictionary or HashMap.
That's why everything is ending up on row 1.
What are you trying to achieve by creating keys of 1?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
On Fri, Apr 21, 2017 at 2:26 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor
In was trying to write the "*Kmeans*" Using Mahout DRM as per the
algorithm
Post by KHATWANI PARTH BHARAT
outlined by Dmitriy.
I was facing the Problem of assigning cluster Ids to the Row Keys
For Example
Consider the below matrix Where column 1 to 3 are the data points
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
column 0 Containing the count of the point
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
now after calculating the centriod which closest to the data
point
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
data
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
zeroth index i am trying to assign the centriod index to *row
key *
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Now Suppose say that every data point is assigned to centriod at
index
Post by KHATWANI PARTH BHARAT
1
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
so after assigning the key=1 to each and every row
using the code below
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size)
{
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
* //assigning 1 to each row index* keys(row) =
1
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
} (keys, block) }
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
I am confused weather assigning the new Key Values to the row
index
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
is
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
done
Post by KHATWANI PARTH BHARAT
through the following code line
* //assigning 1 to each row index* keys(row) = 1
or is there any other way.
I am not able to find any use links or reference on internet even
Andrew
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
and Dmitriy's book also does not have any proper reference for
the
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
above mentioned issue.
Thanks & Regards
Parth Khatwani
On Fri, Apr 21, 2017 at 10:06 PM, Trevor Grant <
Post by Trevor Grant
OK, i dug into this before i read your question carefully, that
was
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
my
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
bad.
Post by Trevor Grant
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to be
{
0 => {1: 5.0} // (not 4.0) // and 6.0 in your example...
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
val drm2 = (A(::, 1 until 4) cbind 0.0).mapBlock() {
case (keys, block) =>
for(row <- 0 until block.nrow) block(row, 3) = block(row,
::).sum
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
(keys, block)
}
val aggTranspose = drm2(::, 3 until 4).t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)
Where we are creating an empty row, then filling it with the
row
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
sums.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
A distributed rowSums fn would be nice for just such an
occasion...
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
sigh
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Let me know if that gets you going again. That was simpler
than
Post by KHATWANI PARTH BHARAT
I
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
thought-
Post by Trevor Grant
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh
question,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
but
Post by Trevor Grant
if
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
you are going to collect this to the driver anyway (not sure if
that
Post by KHATWANI PARTH BHARAT
is
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
case)
A(::, 1 until 4).rowSums would also work.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to
complete
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the Data
Post by KHATWANI PARTH BHARAT
File <https://drive.google.com/open?id=0Bxnnu_
Ig2Et9QjZoM3dmY1V5WXM
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-
Lyubimov
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-
Lyubimov/KmeansMahout.scala> is
Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning
new
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
values
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the key to Row Matrix and aggregating transpose.I think
assigning
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
new
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
values to the key to Row Matrix and aggregating transpose is
causing
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-
Lyubimov/TestClusterAssign.scala>
Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the
expected
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
and
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
actual output of the above code
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and
column
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
1,2
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant <
Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would
expect
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
this
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
application to consume.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
-Virgil*
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help
me
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
out
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
with
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
it.
Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the
above
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
issue.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&
&idSignature=22>
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have
mentioned
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
in
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the
a
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
default
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
value 1
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the
aggregating
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
transpose
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
I have committed the separate test code to the Github
Branch
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/
Spark_Mahout/tree/Dmitriy-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4,
5),
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
(1,4,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
5,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
6))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until
keys.size) {
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
* //assigning 1 to each row index*
keys(row)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
= 1
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
} (keys, block) } prinln("After New Cluster
assignment")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t
println("Result of
Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.
collect)
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Out of 1st println After New Cluster assignment should
be
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
column
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
1,2
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And
Dmitriy
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Apache
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-
Dmitriy-Lyubimov/dp/1523775785> Aggregating
Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely
over
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
here
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
but
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
i
Post by KHATWANI PARTH BHARAT
am
Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.
html
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Also
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Does
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
not
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of
mentioned
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
issue.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <
+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering
Using
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
"Apache
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting
this
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
rolled
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
-Virgil*
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
i would think reassinging keys should work in most
cases.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
The only exception is that technically Spark
contracts
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
imply
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
that
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
effect
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might
be a
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
problem
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from
block
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
cache
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
object
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but
specifically
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
w.r.t.
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the
action
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
obviously
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple
times
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
on
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
same
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH
BHARAT
Post by KHATWANI PARTH BHARAT
<
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the
following
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Incorrectly
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row
Keys
Post by KHATWANI PARTH BHARAT
of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
DRM
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
//11. Iterating over the Data Matrix(in
DrmLike[Int]
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
format)
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
calculate
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the
closest
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
centriod
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex =
findTheClosestCentriod(
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
dataPoint,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to
the
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
current
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
dataPoint
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the
key
Post by Dmitriy Lyubimov
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
corresponding
Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for
the
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
same
Post by Trevor Grant
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
reference
Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH
BHARAT
Post by Dmitriy Lyubimov
<
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch
Having
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Initial
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Kmeans
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/
Spark_Mahout/tree/Dmitriy-Lyub
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
imov>
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <
+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans
Clustering
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Using
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
"Apache
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Samsara"
can't say i can read this code well formatted
that
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
way...
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it would seem to me that the code is not using
the
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
broadcast
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
variable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
and
Post by KHATWANI PARTH BHARAT
instead is using closure variable. that's the
only
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
thing i
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
can
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
immediately
Post by KHATWANI PARTH BHARAT
see by looking in the middle of it.
it would be better if you created a branch on
github
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
for
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
that
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
code
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
that
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH
BHARAT
Post by Trevor Grant
<
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the
algorithm
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
you
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Outline
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod
index
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
to
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
corresponding
Post by Dmitriy Lyubimov
row
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
key
Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11
may
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
be i
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
am
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
using
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
incorrect
Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing
wrong.
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout
context
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer",
"org.apache.spark.serializer.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.
MahoutKryoRegistrator")
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
implicit val sc = new
SparkDistributedContext(new
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
SparkContext(conf))
Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the
rdd
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val lines = sc.textFile(args(1))
//3. convert data read in as string in to
array
Post by KHATWANI PARTH BHARAT
of
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
double
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>
line.split('\t').map(_.toDoubl
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
e))
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
double
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
this
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be
used
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
while
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
calculating
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray =
test.map(addCentriodColumn
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
_)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//5. convert rdd of array of double in rdd
of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
DenseVector
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] =
rdd.zipWithIndex.map
Post by Trevor Grant
{
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
case
Post by KHATWANI PARTH BHARAT
(v,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert
DrmRdd
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
CheckpointedDrm[Int] val matrix =
drmWrap(rddMatrixLike)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//8.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the column having all ones created
in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
step
Post by Trevor Grant
4
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
and
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
will
Post by KHATWANI PARTH BHARAT
use
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until
1)
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
//9.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
final
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val
dataDrmX
Post by Dmitriy Lyubimov
=
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
matrix(::,
Post by KHATWANI PARTH BHARAT
1
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
until
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial
centriods
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
val
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)
centriods.size
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods
val
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
broadCastMatrix
Post by KHATWANI PARTH BHARAT
=
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11.
Iterating
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
over
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Data
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate
the
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
initial
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
centriods
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block)
=>
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
for
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(row
Post by KHATWANI PARTH BHARAT
<-
Post by Dmitriy Lyubimov
0
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint =
block(row,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
::)
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the
closest
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
centriod
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
closesetIndex
Post by KHATWANI PARTH BHARAT
=
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)
//13.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
assigning closest index to key
keys(row)
Post by Dmitriy Lyubimov
=
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
closesetIndex
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b =
(oneVector
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
cbind
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose
(1|D)'
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
val
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after
step
Post by KHATWANI PARTH BHARAT
15
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
bTranspose
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
have data in the following format
/*(n+1)*K
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
where
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n=dimension
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters *
zeroth
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
row
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
contain
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster
*
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
assuming
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
3d
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing
the
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
count
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
vectors
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
out
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0
until
Post by KHATWANI PARTH BHARAT
1,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
::).collect(0,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
::))
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::)
//17.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
dividing
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
data
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
point by count vector
vectorSums.mapBlock() {
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
case
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(keys,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until
block.nrow) {
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
block(row,
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys
->
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
block
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
}
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
//18.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
seperating the count vectors val
newCentriods =
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
vectorSums.t(::,1
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate
over
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
above
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
code
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of
main
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
method
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
// method to find the closest centriod to
data
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
point(
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Vector
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
in the arguments) def
Vector,
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row,
::))
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val tempDist = Math.sqrt(ssr(vec,
matrix(row,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
::)))
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance
between
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
points(Vectors)
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
Array[Double]
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
= {
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
val newArr = new Array[Double](arg.length
+
Post by KHATWANI PARTH BHARAT
1)
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH
BHARAT
Post by Trevor Grant
<
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans
Clustering
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Using
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
"Apache
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Samsara"
ps1 this assumes row-wise construction of A
based
Post by KHATWANI PARTH BHARAT
on
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
training
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
set
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
of m
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
n-dimensional points.
ps2 since we are doing multiple passes over
A
Post by KHATWANI PARTH BHARAT
it
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
may
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
make
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sense to
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
make
Post by KHATWANI PARTH BHARAT
sure it is committed to spark cache (by
using
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
checkpoint
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
api),
Post by KHATWANI PARTH BHARAT
if
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
spark
Post by KHATWANI PARTH BHARAT
is
used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy
Lyubimov <
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
here is the outline. For details of APIs,
please
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
refer
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
samsara
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manual
Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n
matrix
Post by KHATWANI PARTH BHARAT
A.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
For
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simplicity
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
let's
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,
DrmLike[Int].
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
First, classic k-means starts by selecting
initial
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
clusters,
Post by KHATWANI PARTH BHARAT
by
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
sampling
Post by Dmitriy Lyubimov
them out. You can do that by using
sampling
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
api
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
[1],
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
thus
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
forming
Post by KHATWANI PARTH BHARAT
a
Post by KHATWANI PARTH BHARAT
k
Post by KHATWANI PARTH BHARAT
x n
Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C
is
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
therefore
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Mahout's
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Matrix
Post by Dmitriy Lyubimov
type.
You the proceed by alternating between
cluster
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
assignments
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompupting centroid matrix C till
convergence
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
based
Post by Trevor Grant
on
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
some
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
test
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
or
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
simply limited by epoch count budget, your
choice.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Cluster assignments: here, we go over
current
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
generation
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
of A
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
and
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
recompute centroid indexes for each row in
A.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Once
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
we
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
recompute
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
index,
Post by KHATWANI PARTH BHARAT
we
Post by Dmitriy Lyubimov
put it into the row key . You can do that
by
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
assigning
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
centroid
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
indices
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
keys of A using operator mapblock()
(details
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
[2],
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
[3],
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
[4]).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
You
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
also
Post by Dmitriy Lyubimov
need to broadcast C in order to be able to
access
Post by Trevor Grant
it
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
in
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
efficient
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
manner
Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of
that
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
are
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
plenty
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
given
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
[2].
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the
row
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
keys
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
reflect
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cluster
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
index in C. while going over A, you'd
have a
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
"nearest
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
neighbor"
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
problem
Post by KHATWANI PARTH BHARAT
to
Post by Dmitriy Lyubimov
solve for the row of A and centroids C.
This
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
is
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
bulk of
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
computation
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
really, and there are a few tricks there
that
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
can
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
speed
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
step
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
up in
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
both exact and approximate manner, but you
can
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
start
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
with a
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
naive
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
search.
Post by Dmitriy Lyubimov
once you assigned centroids to the keys of
marix
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
A,
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
you'd
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
want
Post by Dmitriy Lyubimov
to
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
do an
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
aggregating transpose of A to compute
essentially
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
average
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
row A
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
grouped
Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a
computation
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
(1|A)'
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
which
Post by KHATWANI PARTH BHARAT
will
Post by Dmitriy Lyubimov
results in a matrix of the shape
(Counts/sums
Post by Dmitriy Lyubimov
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
cluster
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
rows).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
This is
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
part i find difficult to explain without a
latex
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
graphics.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)'
corresponds
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
DRM
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
expression
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column
contains a
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
vector
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
corresponding
Post by KHATWANI PARTH BHARAT
to
Post by KHATWANI PARTH BHARAT
a
Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first
element
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
would
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
be #
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
points in
Post by KHATWANI PARTH BHARAT
the
Post by Dmitriy Lyubimov
cluster, and the rest of it would
correspond
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
to
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
sum
Post by KHATWANI PARTH BHARAT
of
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
all
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
points.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
So
Post by KHATWANI PARTH BHARAT
in
Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we
need
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
collect
Post by KHATWANI PARTH BHARAT
B
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
into
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
memory,
Post by Dmitriy Lyubimov
and slice out counters (first row) from
the
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
rest
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
of
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
it.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0
elements,
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
this
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
will
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
cause
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
lack
Post by KHATWANI PARTH BHARAT
of
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed
C).
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
This operation obviously uses subblocking
and
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
row-wise
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
iteration
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
over
Post by KHATWANI PARTH BHARAT
B,
Post by Dmitriy Lyubimov
for which i am again making reference to
[2].
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[1] https://github.com/apache/
mahout/blob/master/math-scala/
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho
ut/math/drm/package.scala#
Post by KHATWANI PARTH BHARAT
L149
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but
viable,
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github
Post by KHATWANI PARTH BHARAT
.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely
viable
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
for
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
the
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
purpose of
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
this
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
http://apache.github.io/
mahout/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
math-
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
scala/index.htm
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
[4] mapblock etc.
http://apache.github.io/mahout
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
/0.10.1/docs/mahout-
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
math-scala/index.html#org.
apache.mahout.math.drm.
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
RLikeDrmOps
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI
PARTH
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the
approach
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
to
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
move
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
ahead.
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM,
KHATWANI
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
PARTH
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
BHARAT <
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way
ahead.
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Like how to create the augmented
matrix A
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
:=
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
(0|D)
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
which
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
you
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
have
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM,
Dmitriy
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Lyubimov
Post by Trevor Grant
<
has
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
been a
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
bit
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
confusing?
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM,
KHATWANI
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
PARTH
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
BHARAT
Post by KHATWANI PARTH BHARAT
<
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans
clustering
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
algorithm
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
using
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Mahout
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Samsara
Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed
Row
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Matrix
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
for
Post by Trevor Grant
Post by KHATWANI PARTH BHARAT
Post by Trevor Grant
the
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
same.
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
Can
Post by KHATWANI PARTH BHARAT
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
Post by KHATWANI PARTH BHARAT
anybody
Post by KHATWANI PARTH BHARAT
Post by Dmitriy Lyubimov
help
Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani
KHATWANI PARTH BHARAT
2017-04-25 02:34:02 UTC
Permalink
@Trevor and @Dmitriy

Tough Bug in Aggregating Transpose is fixed. One issue is still left which
is causing hindrance in completing the KMeans Code
That issue is of Assigning the the Row Keys of The DRM with the "Closest
Cluster Index" found
Consider the Matrix of Data points given as follows

{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Now these are
0 =>
1 =>
2 =>
3 =>
the Row keys. Here Zeroth column(0) contains the values which will be used
the store the count of Points assigned to each cluster and Column 1 to 3
contains co-ordinates of the data points.

So now after cluster assignment step of Kmeans algorithm which @Dmitriy has
Outlined in the beginning of this mail chain,

the above Matrix should look like this(Assuming that the 0th and 1st data
points are assigned to the cluster with index 0 and 2nd and 3rd data points
are assigned to cluster with index 1)

{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}

to achieve above mentioned result i using following code lines of code

//11. Iterating over the Data Matrix(in DrmLike[Int] format)
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)

//12. findTheClosestCentriod find the closest centriod to the Data
point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint, centriods)

//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}

But it turns out to be

{
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}


So is there any thing wrong with the syntax of the above code.I am unable
to find any reference to the way in which i should assign a value to the
row keys.

@Trevor as per what you have mentioned in the above mail chain
"Got it- in short no.

Think of the keys like a dictionary or HashMap.

That's why everything is ending up on row 1."

But according to Algorithm outlined ***@Dmitriy at start of the mail chain
we assign same key To Multiple Rows is possible.
Same is also mentioned in the Book Written by Dmitriy and Andrew.
It is mentioned that the rows having the same row keys summed up when we
take aggregating transpose.

I now confused that weather it possible to achieve what i have mentioned
above or it is not possible to achieve or it is the Bug in the API.



Thanks & Regards
Parth
<#m_33347126371020841_m_5688102708516554904_>
Khurrum Nasim
2017-04-25 15:37:13 UTC
Permalink
Can mahout be used for self driving tech ?

Thanks,

Khurrum.
Post by KHATWANI PARTH BHARAT
@Trevor and @Dmitriy
Tough Bug in Aggregating Transpose is fixed. One issue is still left which
is causing hindrance in completing the KMeans Code
That issue is of Assigning the the Row Keys of The DRM with the "Closest
Cluster Index" found
Consider the Matrix of Data points given as follows
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Now these are
0 =
1 =
2 =
3 =
the Row keys. Here Zeroth column(0) contains the values which will be used
the store the count of Points assigned to each cluster and Column 1 to 3
contains co-ordinates of the data points.
Outlined in the beginning of this mail chain,
the above Matrix should look like this(Assuming that the 0th and 1st data
points are assigned to the cluster with index 0 and 2nd and 3rd data points
are assigned to cluster with index 1)
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to achieve above mentioned result i using following code lines of code
//11. Iterating over the Data Matrix(in DrmLike[Int] format)
dataDrmX.mapBlock() {
case (keys, block) =
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the Data
point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint, centriods)
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
But it turns out to be
{
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
So is there any thing wrong with the syntax of the above code.I am unable
to find any reference to the way in which i should assign a value to the
row keys.
@Trevor as per what you have mentioned in the above mail chain
"Got it- in short no.
Think of the keys like a dictionary or HashMap.
That's why everything is ending up on row 1."
we assign same key To Multiple Rows is possible.
Same is also mentioned in the Book Written by Dmitriy and Andrew.
It is mentioned that the rows having the same row keys summed up when we
take aggregating transpose.
I now confused that weather it possible to achieve what i have mentioned
above or it is not possible to achieve or it is the Bug in the API.
Thanks & Regards
Parth
<#m_33347126371020841_m_5688102708516554904_
KHATWANI PARTH BHARAT
2017-04-26 09:04:02 UTC
Permalink
@Trevor and @Dmitriy

Tough Bug in Aggregating Transpose is fixed. One issue is still left which
is causing hindrance in completing the KMeans Code
That issue is of Assigning the the Row Keys of The DRM with the "Closest
Cluster Index" found
Consider the Matrix of Data points given as follows

{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Now these are
0 =>
1 =>
2 =>
3 =>
the Row keys. Here Zeroth column(0) contains the values which will be used
the store the count of Points assigned to each cluster and Column 1 to 3
contains co-ordinates of the data points.

So now after cluster assignment step of Kmeans algorithm which @Dmitriy has
Outlined in the beginning of this mail chain,

the above Matrix should look like this(Assuming that the 0th and 1st data
points are assigned to the cluster with index 0 and 2nd and 3rd data points
are assigned to cluster with index 1)

{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}

to achieve above mentioned result i using following code lines of code

//11. Iterating over the Data Matrix(in DrmLike[Int] format)
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)

//12. findTheClosestCentriod find the closest centriod to the Data
point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint, centriods)

//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}

But it turns out to be

{
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}


So is there any thing wrong with the syntax of the above code.I am unable
to find any reference to the way in which i should assign a value to the
row keys.

@Trevor as per what you have mentioned in the above mail chain
"Got it- in short no.

Think of the keys like a dictionary or HashMap.

That's why everything is ending up on row 1."

But according to Algorithm outlined ***@Dmitriy at start of the mail chain
we assign same key To Multiple Rows is possible.
Same is also mentioned in the Book Written by Dmitriy and Andrew.
It is mentioned that the rows having the same row keys summed up when we
take aggregating transpose.

I now confused that weather it possible to achieve what i have mentioned
above or it is not possible to achieve or it is the Bug in the API.



Thanks & Regards
Parth
Post by Khurrum Nasim
Can mahout be used for self driving tech ?
Thanks,
Khurrum.
On Apr 24, 2017, 10:34 PM -0400, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor and @Dmitriy
Tough Bug in Aggregating Transpose is fixed. One issue is still left
which
Post by KHATWANI PARTH BHARAT
is causing hindrance in completing the KMeans Code
That issue is of Assigning the the Row Keys of The DRM with the "Closest
Cluster Index" found
Consider the Matrix of Data points given as follows
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Now these are
0 =
1 =
2 =
3 =
the Row keys. Here Zeroth column(0) contains the values which will be
used
Post by KHATWANI PARTH BHARAT
the store the count of Points assigned to each cluster and Column 1 to 3
contains co-ordinates of the data points.
has
Post by KHATWANI PARTH BHARAT
Outlined in the beginning of this mail chain,
the above Matrix should look like this(Assuming that the 0th and 1st data
points are assigned to the cluster with index 0 and 2nd and 3rd data
points
Post by KHATWANI PARTH BHARAT
are assigned to cluster with index 1)
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to achieve above mentioned result i using following code lines of code
//11. Iterating over the Data Matrix(in DrmLike[Int] format)
dataDrmX.mapBlock() {
case (keys, block) =
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the Data
point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint, centriods)
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
But it turns out to be
{
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
So is there any thing wrong with the syntax of the above code.I am unable
to find any reference to the way in which i should assign a value to the
row keys.
@Trevor as per what you have mentioned in the above mail chain
"Got it- in short no.
Think of the keys like a dictionary or HashMap.
That's why everything is ending up on row 1."
chain
Post by KHATWANI PARTH BHARAT
we assign same key To Multiple Rows is possible.
Same is also mentioned in the Book Written by Dmitriy and Andrew.
It is mentioned that the rows having the same row keys summed up when we
take aggregating transpose.
I now confused that weather it possible to achieve what i have mentioned
above or it is not possible to achieve or it is the Bug in the API.
Thanks & Regards
Parth
<#m_33347126371020841_m_5688102708516554904_
Trevor Grant
2017-05-19 20:59:59 UTC
Permalink
Bumping this-

Parth, is there anything we can do to assist you?



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*


On Mon, Apr 24, 2017 at 9:34 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor and @Dmitriy
Tough Bug in Aggregating Transpose is fixed. One issue is still left which
is causing hindrance in completing the KMeans Code
That issue is of Assigning the the Row Keys of The DRM with the "Closest
Cluster Index" found
Consider the Matrix of Data points given as follows
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Now these are
0 =>
1 =>
2 =>
3 =>
the Row keys. Here Zeroth column(0) contains the values which will be used
the store the count of Points assigned to each cluster and Column 1 to 3
contains co-ordinates of the data points.
Outlined in the beginning of this mail chain,
the above Matrix should look like this(Assuming that the 0th and 1st data
points are assigned to the cluster with index 0 and 2nd and 3rd data points
are assigned to cluster with index 1)
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to achieve above mentioned result i using following code lines of code
//11. Iterating over the Data Matrix(in DrmLike[Int] format)
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the Data
point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint, centriods)
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
But it turns out to be
{
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
So is there any thing wrong with the syntax of the above code.I am unable
to find any reference to the way in which i should assign a value to the
row keys.
@Trevor as per what you have mentioned in the above mail chain
"Got it- in short no.
Think of the keys like a dictionary or HashMap.
That's why everything is ending up on row 1."
we assign same key To Multiple Rows is possible.
Same is also mentioned in the Book Written by Dmitriy and Andrew.
It is mentioned that the rows having the same row keys summed up when we
take aggregating transpose.
I now confused that weather it possible to achieve what i have mentioned
above or it is not possible to achieve or it is the Bug in the API.
Thanks & Regards
Parth
<#m_33347126371020841_m_5688102708516554904_>
KHATWANI PARTH BHARAT
2017-05-20 13:24:09 UTC
Permalink
Hey Trevor,
I have completed the Kmeans code and will soon commit it as per
instructions which you have shared with me the other mail chain.


Best Regards
Parth
Post by Trevor Grant
Bumping this-
Parth, is there anything we can do to assist you?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Mon, Apr 24, 2017 at 9:34 PM, KHATWANI PARTH BHARAT <
Post by KHATWANI PARTH BHARAT
@Trevor and @Dmitriy
Tough Bug in Aggregating Transpose is fixed. One issue is still left
which
Post by KHATWANI PARTH BHARAT
is causing hindrance in completing the KMeans Code
That issue is of Assigning the the Row Keys of The DRM with the "Closest
Cluster Index" found
Consider the Matrix of Data points given as follows
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Now these are
0 =>
1 =>
2 =>
3 =>
the Row keys. Here Zeroth column(0) contains the values which will be
used
Post by KHATWANI PARTH BHARAT
the store the count of Points assigned to each cluster and Column 1 to 3
contains co-ordinates of the data points.
has
Post by KHATWANI PARTH BHARAT
Outlined in the beginning of this mail chain,
the above Matrix should look like this(Assuming that the 0th and 1st data
points are assigned to the cluster with index 0 and 2nd and 3rd data
points
Post by KHATWANI PARTH BHARAT
are assigned to cluster with index 1)
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to achieve above mentioned result i using following code lines of code
//11. Iterating over the Data Matrix(in DrmLike[Int] format)
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data
Post by KHATWANI PARTH BHARAT
point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint,
centriods)
Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
But it turns out to be
{
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
So is there any thing wrong with the syntax of the above code.I am unable
to find any reference to the way in which i should assign a value to the
row keys.
@Trevor as per what you have mentioned in the above mail chain
"Got it- in short no.
Think of the keys like a dictionary or HashMap.
That's why everything is ending up on row 1."
chain
Post by KHATWANI PARTH BHARAT
we assign same key To Multiple Rows is possible.
Same is also mentioned in the Book Written by Dmitriy and Andrew.
It is mentioned that the rows having the same row keys summed up when we
take aggregating transpose.
I now confused that weather it possible to achieve what i have mentioned
above or it is not possible to achieve or it is the Bug in the API.
Thanks & Regards
Parth
<#m_33347126371020841_m_5688102708516554904_>
Loading...