Edmond Luo (JIRA)
2016-04-19 08:20:25 UTC
Edmond Luo created MAHOUT-1833:
----------------------------------
Summary: One more svec function accepting cardinality as parameter
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
* @param cardinality sdata
* @return
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
Below code should demonstrate the case:
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}
Notice that in the last map, the svec in above accepts one more parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
----------------------------------
Summary: One more svec function accepting cardinality as parameter
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
* @param cardinality sdata
* @return
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
Below code should demonstrate the case:
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}
Notice that in the last map, the svec in above accepts one more parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)