Discussion:
[jira] [Created] (MAHOUT-1888) Performance Bug with Mahout Vector Serialization
Suneel Marthi (JIRA)
2016-10-09 23:22:20 UTC
Permalink
Suneel Marthi created MAHOUT-1888:
-------------------------------------

Summary: Performance Bug with Mahout Vector Serialization
Key: MAHOUT-1888
URL: https://issues.apache.org/jira/browse/MAHOUT-1888
Project: Mahout
Issue Type: Bug
Components: spark
Affects Versions: 0.12.2
Reporter: Suneel Marthi
Fix For: 0.13.0


Identified a performance bug with Mahout Vector serialization in DistributedSparkSuite.

Add the following

{Code}
.set("spark.kryo.registrationRequired", "true")
{Code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Suneel Marthi (JIRA)
2016-10-09 23:22:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suneel Marthi reassigned MAHOUT-1888:
-------------------------------------

Assignee: Suneel Marthi
Post by Suneel Marthi (JIRA)
Performance Bug with Mahout Vector Serialization
------------------------------------------------
Key: MAHOUT-1888
URL: https://issues.apache.org/jira/browse/MAHOUT-1888
Project: Mahout
Issue Type: Bug
Components: spark
Affects Versions: 0.12.2
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Fix For: 0.13.0
Identified a performance bug with Mahout Vector serialization in DistributedSparkSuite.
Add the following
{Code}
.set("spark.kryo.registrationRequired", "true")
{Code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-10-10 00:17:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15560885#comment-15560885 ]

ASF GitHub Bot commented on MAHOUT-1888:
----------------------------------------

GitHub user smarthi opened a pull request:

https://github.com/apache/mahout/pull/260

MAHOUT-1888: [WIP] Performance Bug with Mahout Vector Serialization



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smarthi/mahout MAHOUT-1888

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mahout/pull/260.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #260

----
commit 839f404cee25290c43b4cdeaadeb7c3ec865b448
Author: smarthi <***@apache.org>
Date: 2016-10-10T00:16:26Z

MAHOUT-1888: [WIP] Performance Bug with Mahout Vector Serialization

----
Post by Suneel Marthi (JIRA)
Performance Bug with Mahout Vector Serialization
------------------------------------------------
Key: MAHOUT-1888
URL: https://issues.apache.org/jira/browse/MAHOUT-1888
Project: Mahout
Issue Type: Bug
Components: spark
Affects Versions: 0.12.2
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Fix For: 0.13.0
Identified a performance bug with Mahout Vector serialization in DistributedSparkSuite.
Add the following
{Code}
.set("spark.kryo.registrationRequired", "true")
{Code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-10-10 00:17:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15560886#comment-15560886 ]

ASF GitHub Bot commented on MAHOUT-1888:
----------------------------------------

Github user smarthi commented on the issue:

https://github.com/apache/mahout/pull/260

Do not merge, this is still WIP.
Post by Suneel Marthi (JIRA)
Performance Bug with Mahout Vector Serialization
------------------------------------------------
Key: MAHOUT-1888
URL: https://issues.apache.org/jira/browse/MAHOUT-1888
Project: Mahout
Issue Type: Bug
Components: spark
Affects Versions: 0.12.2
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Fix For: 0.13.0
Identified a performance bug with Mahout Vector serialization in DistributedSparkSuite.
Add the following
{Code}
.set("spark.kryo.registrationRequired", "true")
{Code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-10-12 03:56:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567471#comment-15567471 ]

ASF GitHub Bot commented on MAHOUT-1888:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/260#discussion_r82932511

--- Diff: spark/src/test/scala/org/apache/mahout/sparkbindings/test/DistributedSparkSuite.scala ---
@@ -45,6 +45,7 @@ trait DistributedSparkSuite extends DistributedMahoutSuite with LoggerConfigurat
.set("spark.akka.frameSize", "30")
.set("spark.default.parallelism", "10")
.set("spark.executor.memory", "2G")
+ .set("spark.kryo.registrationRequired", "true")
--- End diff --

This is not needed, this is why it is failing. we can enable that to see what else is left, but we don't have to patch every class out there that is used in tests. I think the classes that still have that have something to do with IndexedDataSet which is (in my view) not part of algebra engine, so we can ignore the rest.
Post by Suneel Marthi (JIRA)
Performance Bug with Mahout Vector Serialization
------------------------------------------------
Key: MAHOUT-1888
URL: https://issues.apache.org/jira/browse/MAHOUT-1888
Project: Mahout
Issue Type: Bug
Components: spark
Affects Versions: 0.12.2
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Fix For: 0.13.0
Identified a performance bug with Mahout Vector serialization in DistributedSparkSuite.
Add the following
{Code}
.set("spark.kryo.registrationRequired", "true")
{Code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-10-12 03:56:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567472#comment-15567472 ]

ASF GitHub Bot commented on MAHOUT-1888:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/260#discussion_r82932383

--- Diff: spark/src/test/scala/org/apache/mahout/sparkbindings/blas/BlasSuite.scala ---
@@ -39,11 +42,11 @@ class BlasSuite extends FunSuite with DistributedSparkSuite {
val drmA = drmParallelize(m = inCoreA, numPartitions = 3)
val drmB = drmParallelize(m = inCoreB, numPartitions = 2)

- val op = new OpABt(drmA, drmB)
+ val op = OpABt(drmA, drmB)

val drm = new CheckpointedDrmSpark(ABt.abt(op, srcA = drmA, srcB = drmB), op.nrow, op.ncol)

- printf("AB' num partitions = %d.\n", drm.rdd.partitions.size)
+ printf("AB' num partitions = %d.\n", drm.rdd.partitions.length)
--- End diff --

in scala, we use size() to measure collection cardinality. using .length may cause conversion to a java collection.
Post by Suneel Marthi (JIRA)
Performance Bug with Mahout Vector Serialization
------------------------------------------------
Key: MAHOUT-1888
URL: https://issues.apache.org/jira/browse/MAHOUT-1888
Project: Mahout
Issue Type: Bug
Components: spark
Affects Versions: 0.12.2
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Fix For: 0.13.0
Identified a performance bug with Mahout Vector serialization in DistributedSparkSuite.
Add the following
{Code}
.set("spark.kryo.registrationRequired", "true")
{Code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-10-12 10:27:21 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15568330#comment-15568330 ]

ASF GitHub Bot commented on MAHOUT-1888:
----------------------------------------

Github user smarthi commented on a diff in the pull request:

https://github.com/apache/mahout/pull/260#discussion_r82977426

--- Diff: spark/src/test/scala/org/apache/mahout/sparkbindings/test/DistributedSparkSuite.scala ---
@@ -45,6 +45,7 @@ trait DistributedSparkSuite extends DistributedMahoutSuite with LoggerConfigurat
.set("spark.akka.frameSize", "30")
.set("spark.default.parallelism", "10")
.set("spark.executor.memory", "2G")
+ .set("spark.kryo.registrationRequired", "true")
--- End diff --

Fixed the comments.
Post by Suneel Marthi (JIRA)
Performance Bug with Mahout Vector Serialization
------------------------------------------------
Key: MAHOUT-1888
URL: https://issues.apache.org/jira/browse/MAHOUT-1888
Project: Mahout
Issue Type: Bug
Components: spark
Affects Versions: 0.12.2
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Fix For: 0.13.0
Identified a performance bug with Mahout Vector serialization in DistributedSparkSuite.
Add the following
{Code}
.set("spark.kryo.registrationRequired", "true")
{Code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Suneel Marthi (JIRA)
2016-10-12 14:43:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on MAHOUT-1888 started by Suneel Marthi.
---------------------------------------------
Post by Suneel Marthi (JIRA)
Performance Bug with Mahout Vector Serialization
------------------------------------------------
Key: MAHOUT-1888
URL: https://issues.apache.org/jira/browse/MAHOUT-1888
Project: Mahout
Issue Type: Bug
Components: spark
Affects Versions: 0.12.2
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Fix For: 0.13.0
Identified a performance bug with Mahout Vector serialization in DistributedSparkSuite.
Add the following
{Code}
.set("spark.kryo.registrationRequired", "true")
{Code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-10-14 01:51:21 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573818#comment-15573818 ]

ASF GitHub Bot commented on MAHOUT-1888:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/260#discussion_r83346943

--- Diff: spark/src/test/scala/org/apache/mahout/sparkbindings/test/DistributedSparkSuite.scala ---
@@ -45,6 +45,7 @@ trait DistributedSparkSuite extends DistributedMahoutSuite with LoggerConfigurat
.set("spark.akka.frameSize", "30")
.set("spark.default.parallelism", "10")
.set("spark.executor.memory", "2G")
+ .set("spark.kryo.registrationRequired", "true")
--- End diff --

please remove line 48
Post by Suneel Marthi (JIRA)
Performance Bug with Mahout Vector Serialization
------------------------------------------------
Key: MAHOUT-1888
URL: https://issues.apache.org/jira/browse/MAHOUT-1888
Project: Mahout
Issue Type: Bug
Components: spark
Affects Versions: 0.12.2
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Fix For: 0.13.0
Identified a performance bug with Mahout Vector serialization in DistributedSparkSuite.
Add the following
{Code}
.set("spark.kryo.registrationRequired", "true")
{Code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-10-14 02:21:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573883#comment-15573883 ]

ASF GitHub Bot commented on MAHOUT-1888:
----------------------------------------

Github user smarthi commented on a diff in the pull request:

https://github.com/apache/mahout/pull/260#discussion_r83348819

--- Diff: spark/src/test/scala/org/apache/mahout/sparkbindings/test/DistributedSparkSuite.scala ---
@@ -45,6 +45,7 @@ trait DistributedSparkSuite extends DistributedMahoutSuite with LoggerConfigurat
.set("spark.akka.frameSize", "30")
.set("spark.default.parallelism", "10")
.set("spark.executor.memory", "2G")
+ .set("spark.kryo.registrationRequired", "true")
--- End diff --

Its been removed, the recent github changes for code review retain the original lines that were commented; confused me too.
Post by Suneel Marthi (JIRA)
Performance Bug with Mahout Vector Serialization
------------------------------------------------
Key: MAHOUT-1888
URL: https://issues.apache.org/jira/browse/MAHOUT-1888
Project: Mahout
Issue Type: Bug
Components: spark
Affects Versions: 0.12.2
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Fix For: 0.13.0
Identified a performance bug with Mahout Vector serialization in DistributedSparkSuite.
Add the following
{Code}
.set("spark.kryo.registrationRequired", "true")
{Code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-10-14 02:35:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573915#comment-15573915 ]

ASF GitHub Bot commented on MAHOUT-1888:
----------------------------------------

Github user smarthi commented on a diff in the pull request:

https://github.com/apache/mahout/pull/260#discussion_r83349655

--- Diff: spark/src/test/scala/org/apache/mahout/sparkbindings/test/DistributedSparkSuite.scala ---
@@ -45,6 +45,7 @@ trait DistributedSparkSuite extends DistributedMahoutSuite with LoggerConfigurat
.set("spark.akka.frameSize", "30")
.set("spark.default.parallelism", "10")
.set("spark.executor.memory", "2G")
+ .set("spark.kryo.registrationRequired", "true")
--- End diff --

Merging this to master. Also bumped Spark version to 1.6.2 and Flink to 1.1.3 as part of this PR.
Post by Suneel Marthi (JIRA)
Performance Bug with Mahout Vector Serialization
------------------------------------------------
Key: MAHOUT-1888
URL: https://issues.apache.org/jira/browse/MAHOUT-1888
Project: Mahout
Issue Type: Bug
Components: spark
Affects Versions: 0.12.2
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Fix For: 0.13.0
Identified a performance bug with Mahout Vector serialization in DistributedSparkSuite.
Add the following
{Code}
.set("spark.kryo.registrationRequired", "true")
{Code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-10-14 02:40:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573926#comment-15573926 ]

ASF GitHub Bot commented on MAHOUT-1888:
----------------------------------------

Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/260
Post by Suneel Marthi (JIRA)
Performance Bug with Mahout Vector Serialization
------------------------------------------------
Key: MAHOUT-1888
URL: https://issues.apache.org/jira/browse/MAHOUT-1888
Project: Mahout
Issue Type: Bug
Components: spark
Affects Versions: 0.12.2
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Fix For: 0.13.0
Identified a performance bug with Mahout Vector serialization in DistributedSparkSuite.
Add the following
{Code}
.set("spark.kryo.registrationRequired", "true")
{Code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Suneel Marthi (JIRA)
2016-10-14 02:40:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suneel Marthi resolved MAHOUT-1888.
-----------------------------------
Resolution: Fixed
Post by Suneel Marthi (JIRA)
Performance Bug with Mahout Vector Serialization
------------------------------------------------
Key: MAHOUT-1888
URL: https://issues.apache.org/jira/browse/MAHOUT-1888
Project: Mahout
Issue Type: Bug
Components: spark
Affects Versions: 0.12.2
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Fix For: 0.13.0
Identified a performance bug with Mahout Vector serialization in DistributedSparkSuite.
Add the following
{Code}
.set("spark.kryo.registrationRequired", "true")
{Code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Hudson (JIRA)
2016-10-14 03:19:21 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574019#comment-15574019 ]

Hudson commented on MAHOUT-1888:
--------------------------------

FAILURE: Integrated in Jenkins build Mahout-Quality #3399 (See [https://builds.apache.org/job/Mahout-Quality/3399/])
MAHOUT-1888: Performance Bug with Mahout Vector Serialization, this (smarthi: rev 292b718a633efae5ff1b47772b9dd7bf9f1ca6da)
* (edit) spark/src/main/scala/org/apache/mahout/common/HDFSPathSearch.scala
* (edit) math-scala/src/test/scala/org/apache/mahout/math/scalabindings/RLikeVectorOpsSuite.scala
* (edit) math/src/main/java/org/apache/mahout/math/TransposedMatrixView.java
* (edit) .travis.yml
* (edit) math-scala/src/test/scala/org/apache/mahout/math/scalabindings/MathSuite.scala
* (edit) spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala
* (edit) spark/src/test/scala/org/apache/mahout/sparkbindings/blas/BlasSuite.scala
* (edit) pom.xml
Post by Suneel Marthi (JIRA)
Performance Bug with Mahout Vector Serialization
------------------------------------------------
Key: MAHOUT-1888
URL: https://issues.apache.org/jira/browse/MAHOUT-1888
Project: Mahout
Issue Type: Bug
Components: spark
Affects Versions: 0.12.2
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Fix For: 0.13.0
Identified a performance bug with Mahout Vector serialization in DistributedSparkSuite.
Add the following
{Code}
.set("spark.kryo.registrationRequired", "true")
{Code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Loading...