Discussion:
[jira] [Created] (MAHOUT-1833) One more svec function accepting cardinality as parameter
Edmond Luo (JIRA)
2016-04-19 08:20:25 UTC
Permalink
Edmond Luo created MAHOUT-1833:
----------------------------------

Summary: One more svec function accepting cardinality as parameter
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo


It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings

{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
* @param cardinality sdata
* @return
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}

val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}

So user can specify the cardinality for the created sparse vector.

This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).

Below code should demonstrate the case:
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))

val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))

// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}

Notice that in the last map, the svec in above accepts one more parameter, so the cardinality of those created SparseVector can be consistent.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Edmond Luo (JIRA)
2016-04-19 08:24:25 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247368#comment-15247368 ]

Edmond Luo commented on MAHOUT-1833:
------------------------------------

I have implemented all above code and added some testing code for the new function,
However, I am not sure if I should add some DRM test cases using sparse vector, seems now we do not have any test case for those DRM built from sparse vector.
Post by Edmond Luo (JIRA)
One more svec function accepting cardinality as parameter
----------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}
Notice that in the last map, the svec in above accepts one more parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Edmond Luo (JIRA)
2016-04-19 08:28:25 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247368#comment-15247368 ]

Edmond Luo edited comment on MAHOUT-1833 at 4/19/16 8:27 AM:
-------------------------------------------------------------

I have implemented the new wrapper function as shown above and added some testing code for the new function,
However, I am not sure if I should add some DRM test cases using sparse vector, seems now we do not have any test case for those DRM built from sparse vector.


was (Author: resec):
I have implemented all above code and added some testing code for the new function,
However, I am not sure if I should add some DRM test cases using sparse vector, seems now we do not have any test case for those DRM built from sparse vector.
Post by Edmond Luo (JIRA)
One more svec function accepting cardinality as parameter
----------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}
Notice that in the last map, the svec in above accepts one more parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Edmond Luo (JIRA)
2016-04-19 08:29:25 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edmond Luo updated MAHOUT-1833:
-------------------------------
Description:
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings

{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
* @param cardinality sdata
* @return
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}

val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}

So user can specify the cardinality for the created sparse vector.

This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).

Below code should demonstrate the case:
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))

val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))

// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}

Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.


was:
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings

{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
* @param cardinality sdata
* @return
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}

val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}

So user can specify the cardinality for the created sparse vector.

This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).

Below code should demonstrate the case:
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))

val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))

// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}

Notice that in the last map, the svec in above accepts one more parameter, so the cardinality of those created SparseVector can be consistent.
Post by Edmond Luo (JIRA)
One more svec function accepting cardinality as parameter
----------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Suneel Marthi
2016-04-19 10:19:28 UTC
Permalink
Would you like to make a PR that can be reviewed?

Sent from my iPhone
Post by Edmond Luo (JIRA)
----------------------------------
Summary: One more svec function accepting cardinality as parameter
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}
Notice that in the last map, the svec in above accepts one more parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Suneel Marthi (JIRA)
2016-04-19 11:31:25 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247585#comment-15247585 ]

Suneel Marthi commented on MAHOUT-1833:
---------------------------------------

Thanks Edmond, please create a PR. Its easier to comment and review a PR.
Post by Edmond Luo (JIRA)
One more svec function accepting cardinality as parameter
----------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-04-20 09:25:25 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249549#comment-15249549 ]

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

GitHub user resec opened a pull request:

https://github.com/apache/mahout/pull/224

MAHOUT-1833 - Enhance svec function to accept cardinality as parameter

### What is this PR for?
Enhance the existing svec function to accept cardinality as parameter(with default value defined), so user can specify the created vector size they want.

### What type of PR is it?
[Improvement]

### Todos
* [x] - Add the cardinality parameter to svec with default value defined
* [x] - Add test case to MathSuite
* [ ] - Update any doc if needed(pending to check)

### What is the Jira issue?
* Open an issue on Jira https://issues.apache.org/jira/browse/ZEPPELIN/
* Put link here, and add [ZEPPELIN-*Jira number*] in PR title, eg. [ZEPPELIN-533]

### How should this be tested?
1. Clone the code into local
2. Maven build and test, all tests should go to green

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? Pending to check

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/resec/mahout new_svec

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mahout/pull/224.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #224

----
commit c28ad3eed7329b0da7837f1102c8c1f8fba021f8
Author: yougoer <***@yougoer.com>
Date: 2016-04-20T08:56:11Z

[MAHOUT-1833] add one more param cardinality with default value -1 and corresponding test cases

----
Post by Edmond Luo (JIRA)
One more svec function accepting cardinality as parameter
----------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Edmond Luo (JIRA)
2016-04-20 09:30:25 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edmond Luo updated MAHOUT-1833:
-------------------------------
Summary: Enhance svec function to accepting cardinality as parameter (was: One more svec function accepting cardinality as parameter )
Enhance svec function to accepting cardinality as parameter
------------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-04-20 15:45:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250115#comment-15250115 ]

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

Github user resec commented on the pull request:

https://github.com/apache/mahout/pull/224#issuecomment-212483702

`[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on project mahout-mr: ExecutionException: java.lang.RuntimeException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
[ERROR] Command was /bin/sh -c cd /home/travis/build/apache/mahout/mr && /usr/lib/jvm/java-7-oracle/jre/bin/java -Djava.security.auth.login.config=/home/travis/build/apache/mahout/mr/target/../../buildtools/src/test/resources/jaas.config -jar /home/travis/build/apache/mahout/mr/target/surefire/surefirebooter4383272596373359636.jar /home/travis/build/apache/mahout/mr/target/surefire/surefire2790607970252828964tmp /home/travis/build/apache/mahout/mr/target/surefire/surefire_1681133930090970673085tmp`

The Travis CI failed for some strange reason here, any help?
Post by Edmond Luo (JIRA)
Enhance svec function to accepting cardinality as parameter
------------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-04-20 15:47:25 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250119#comment-15250119 ]

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

Github user smarthi commented on the pull request:

https://github.com/apache/mahout/pull/224#issuecomment-212484167

Ignore the travis failures, its still WIP. Thanks for the PR, we'll merge it soon for the next release.
Post by Edmond Luo (JIRA)
Enhance svec function to accepting cardinality as parameter
------------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-04-20 16:21:25 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250180#comment-15250180 ]

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/224#issuecomment-212498892

conceptually +1
Post by Edmond Luo (JIRA)
Enhance svec function to accepting cardinality as parameter
------------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-04-20 17:08:25 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250273#comment-15250273 ]

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

Github user resec commented on the pull request:

https://github.com/apache/mahout/pull/224#issuecomment-212517717

I cannot find any doc to update(but is it not a bad thing?), so I am assuming that there is no doc for this PR to update, please do let me know if I missed anything.
Post by Edmond Luo (JIRA)
Enhance svec function to accepting cardinality as parameter
------------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-04-20 17:21:25 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250307#comment-15250307 ]

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/224#issuecomment-212522188

@resec there's authoritative, sort of, original doc in this branch. https://github.com/apache/mahout/tree/gh-pages/doc

You may take it and submit a PR to it (it requires lyx/latex editor, assuming you are on ubuntu).

Good thing is that you can take it and modify it with a usual pull request process.

Bad things about doing it that way are that, first, it is sort of an authored document I originally contributed from elsewhere (I can remove any attributions if really becomes community-maintained).

But the real reason is, secondly, currently it really being migrated to ASF CMS, and it needs to be changed here : http://mahout.apache.org/users/environment/in-core-reference.html which requires apache committer status.

So you really would need to change ASF CMS page and if you don't have privileges to do it, somebody would have to do it on your behalf after this PR is in.
Post by Edmond Luo (JIRA)
Enhance svec function to accepting cardinality as parameter
------------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Edmond Luo (JIRA)
2016-04-21 01:05:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edmond Luo updated MAHOUT-1833:
-------------------------------
Description:
It will be nice to enhance the existing svec function like below to org.apache.mahout.math.scalabindings

{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
* @param cardinality sdata
* @return
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}

val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}

So user can specify the cardinality for the created sparse vector.

This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).

Below code should demonstrate the case:
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))

val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))

// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}

Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.


was:
It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings

{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
* @param cardinality sdata
* @return
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}

val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}

So user can specify the cardinality for the created sparse vector.

This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).

Below code should demonstrate the case:
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2)))

val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))

// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}

Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
Post by Edmond Luo (JIRA)
Enhance svec function to accepting cardinality as parameter
------------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to enhance the existing svec function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Edmond Luo (JIRA)
2016-04-21 01:08:25 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edmond Luo updated MAHOUT-1833:
-------------------------------
Summary: Enhance svec function to accept cardinality as parameter (was: Enhance svec function to accepting cardinality as parameter )
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to enhance the existing svec function like below to org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Edmond Luo (JIRA)
2016-04-21 01:08:25 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edmond Luo updated MAHOUT-1833:
-------------------------------
Description:
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings

{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
* @param cardinality sdata
* @return
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}

val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}

So user can specify the cardinality for the created sparse vector.

This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).

Below code should demonstrate the case:
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))

val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))

// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}

Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.


was:
It will be nice to enhance the existing svec function like below to org.apache.mahout.math.scalabindings

{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
* @param cardinality sdata
* @return
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}

val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}

So user can specify the cardinality for the created sparse vector.

This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).

Below code should demonstrate the case:
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))

val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))

// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}

Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Edmond Luo (JIRA)
2016-04-21 01:13:25 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edmond Luo updated MAHOUT-1833:
-------------------------------
Description:
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings

{code}
/**
* create a sparse vector out of list of tuple2's
* @param sdata cardinality
* @return
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}

So user can specify the cardinality for the created sparse vector.

This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).

Below code should demonstrate the case:
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))

val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))

// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}

Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.


was:
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings

{code}
/**
* create a sparse vector out of list of tuple2's with specific cardinality(size),
* throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata
* @param cardinality sdata
* @return
*/
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!")
}

val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}

So user can specify the cardinality for the created sparse vector.

This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).

Below code should demonstrate the case:
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))

val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))

// All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}

Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent.
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-04-21 01:44:25 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251090#comment-15251090 ]

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

Github user resec commented on the pull request:

https://github.com/apache/mahout/pull/224#issuecomment-212690207

@dlyubimov, thanks for the great detailed explanation.

I think I don't have privileges to edit this [In Core Reference Page](http://mahout.apache.org/users/environment/in-core-reference.html), so I guess somebody may need to help.

And for the authoritative doc in another branch, I think I can help, will submit another PR accordingly soon.
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-04-22 18:04:13 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254341#comment-15254341 ]

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/224#issuecomment-213532181

@resec as far as updating the "In Core Reference Page" if its a short addition, It may be easiest to just add the text to your JIRA (or here on the PR) that way one of us can just make the addition.

This looks good to me +1 to commit this.
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-04-22 18:09:13 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254347#comment-15254347 ]

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/224#issuecomment-213535411
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-04-22 18:15:13 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254359#comment-15254359 ]

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/224#issuecomment-213538075

i am +1 contingent on all tests working.
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-04-22 18:26:12 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254393#comment-15254393 ]

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/224#issuecomment-213541663

thx. The PR was opened wile we were configuring Travis. I'll test it locally and commit it later.
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-04-22 20:42:12 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254636#comment-15254636 ]

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/224
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Andrew Palumbo (JIRA)
2016-04-22 20:57:12 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Palumbo updated MAHOUT-1833:
-----------------------------------
Assignee: Edmond Luo
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
Assignee: Edmond Luo
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Suneel Marthi (JIRA)
2016-04-22 21:00:15 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suneel Marthi updated MAHOUT-1833:
----------------------------------
Fix Version/s: 0.12.1
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Components: Math
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
Assignee: Edmond Luo
Fix For: 0.12.1
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Suneel Marthi (JIRA)
2016-04-22 21:00:15 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suneel Marthi updated MAHOUT-1833:
----------------------------------
Component/s: Math
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Components: Math
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
Assignee: Edmond Luo
Fix For: 0.12.1
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Andrew Palumbo (JIRA)
2016-04-22 21:01:12 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254675#comment-15254675 ]

Andrew Palumbo commented on MAHOUT-1833:
----------------------------------------

Thanks for the contribution, [~resec]. I've commited it to master. I'll leave this open for you to post the documentation changes to the (In-Core Algebra)[http://mahout.apache.org/users/environment/in-core-reference.html] page.
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Components: Math
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
Assignee: Edmond Luo
Fix For: 0.12.1
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Andrew Palumbo (JIRA)
2016-04-22 21:02:12 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254675#comment-15254675 ]

Andrew Palumbo edited comment on MAHOUT-1833 at 4/22/16 9:01 PM:
-----------------------------------------------------------------

Thanks for the contribution, [~resec]. I've commited it to master. I'll leave this open for you to post the documentation changes to the In-Core Algebra page.




was (Author: andrew_palumbo):
Thanks for the contribution, [~resec]. I've commited it to master. I'll leave this open for you to post the documentation changes to the (In-Core Algebra)[http://mahout.apache.org/users/environment/in-core-reference.html] page.
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Components: Math
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
Assignee: Edmond Luo
Fix For: 0.12.1
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-04-25 01:59:12 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255784#comment-15255784 ]

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

Github user resec commented on the pull request:

https://github.com/apache/mahout/pull/224#issuecomment-214092359

@andrewpalumbo

Thanks for your help and sorry for this late responding, it has been busy days.

Let me provide the doc amendment here.

In [Mahout-Samsara's In-Core Linear Algebra DSL Reference](http://mahout.apache.org/users/environment/in-core-reference.html), in session `[Inline initalization]` -> `[Sparse vectors]`, we can add below to the end of it:

```
// to create a vector with specific cardinality
val sparseVec1 = svec((5 -> 1.0) :: (10 -> 2.0) :: Nil, cardinality=20)
```

In the [Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf), in session 1.2 `[Inline initialization]`, we can add the same thing as shown above.
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Components: Math
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
Assignee: Edmond Luo
Fix For: 0.12.1
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-04-26 19:50:13 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258787#comment-15258787 ]

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/224#issuecomment-214865235

no problem, Thanks for the contribution, @resec! I've updated http://mahout.apache.org/users/environment/in-core-reference.html to show this feature.

Please note that http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf is a different document than @dlyubimov pointed out above, and i can not edit that.

I believe that you meant this doc: https://github.com/apache/mahout/tree/gh-pages/doc

I will update that when I get a chance, If you have not already made a PR against it.

Thanks.
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Components: Math
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
Assignee: Edmond Luo
Fix For: 0.12.1
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Andrew Palumbo (JIRA)
2016-04-26 20:17:12 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Palumbo resolved MAHOUT-1833.
------------------------------------
Resolution: Implemented

committed to master. Thanks!
Post by ASF GitHub Bot (JIRA)
Enhance svec function to accept cardinality as parameter
---------------------------------------------------------
Key: MAHOUT-1833
URL: https://issues.apache.org/jira/browse/MAHOUT-1833
Project: Mahout
Issue Type: Improvement
Components: Math
Affects Versions: 0.12.0
Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
Centos7 64bit
Reporter: Edmond Luo
Assignee: Edmond Luo
Fix For: 0.12.1
It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
{code}
/**
* create a sparse vector out of list of tuple2's
*/
def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
var tmp = -1
if (cardinality < 0) {
tmp = required
} else if (cardinality < required) {
throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
} else {
tmp = cardinality
}
val initialCapacity = sdata.size
val sv = new RandomAccessSparseVector(tmp, initialCapacity)
sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
sv
}
{code}
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
// All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Loading...