Discussion:
[jira] [Created] (MAHOUT-1869) Create a runtime performance measuring framework for mahout
Saikat Kanjilal (JIRA)
2016-06-03 04:33:59 UTC
Permalink
Saikat Kanjilal created MAHOUT-1869:
---------------------------------------

Summary: Create a runtime performance measuring framework for mahout
Key: MAHOUT-1869
URL: https://issues.apache.org/jira/browse/MAHOUT-1869
Project: Mahout
Issue Type: Story
Components: build, Classification, Collaborative Filtering, Math
Affects Versions: 1.0.0
Reporter: Saikat Kanjilal
Fix For: 1.0.0


This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.


Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.


Clustering
The application will consist of a set of rest APIs to do the following:


a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets


/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets


The above API call will return a runId which the client program can then use to monitor the module




b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456




The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.




Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.






b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.

How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Saikat Kanjilal (JIRA)
2016-06-03 15:27:59 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Saikat Kanjilal updated MAHOUT-1869:
------------------------------------
Description:
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.

github repo is here: https://github.com/skanjila/mahout, will send pull request when I have 1 algorithm operational


Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.


Clustering
The application will consist of a set of rest APIs to do the following:


a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets


/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets


The above API call will return a runId which the client program can then use to monitor the module




b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456




The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.




Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.






b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.

How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.


was:
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.


Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.


Clustering
The application will consist of a set of rest APIs to do the following:


a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets


/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets


The above API call will return a runId which the client program can then use to monitor the module




b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456




The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.




Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.






b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.

How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
Post by Saikat Kanjilal (JIRA)
Create a runtime performance measuring framework for mahout
-----------------------------------------------------------
Key: MAHOUT-1869
URL: https://issues.apache.org/jira/browse/MAHOUT-1869
Project: Mahout
Issue Type: Story
Components: build, Classification, Collaborative Filtering, Math
Affects Versions: 1.0.0
Reporter: Saikat Kanjilal
Labels: build
Fix For: 1.0.0
Original Estimate: 1,008h
Remaining Estimate: 1,008h
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
github repo is here: https://github.com/skanjila/mahout, will send pull request when I have 1 algorithm operational
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Saikat Kanjilal (JIRA)
2016-06-05 21:07:59 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316019#comment-15316019 ]

Saikat Kanjilal commented on MAHOUT-1869:
-----------------------------------------

Moved all the code to a new branch mahout-1869, renamed all of the spray sample code, next steps will be to get the code to compile, added dependencies for spray/akka/org.json4s
Post by Saikat Kanjilal (JIRA)
Create a runtime performance measuring framework for mahout
-----------------------------------------------------------
Key: MAHOUT-1869
URL: https://issues.apache.org/jira/browse/MAHOUT-1869
Project: Mahout
Issue Type: Story
Components: build, Classification, Collaborative Filtering, Math
Affects Versions: 1.0.0
Reporter: Saikat Kanjilal
Labels: build
Fix For: 1.0.0
Original Estimate: 1,008h
Remaining Estimate: 1,008h
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
github repo is here: https://github.com/skanjila/mahout, will send pull request when I have 1 algorithm operational
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Saikat Kanjilal (JIRA)
2016-06-10 04:40:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323856#comment-15323856 ]

Saikat Kanjilal commented on MAHOUT-1869:
-----------------------------------------

ok so after a bit of toiling I have some code compiling at least for the newly minted perf module, I will now add some instrumentation for a simple naive bayes implementation and some timers to time the overall run
Post by Saikat Kanjilal (JIRA)
Create a runtime performance measuring framework for mahout
-----------------------------------------------------------
Key: MAHOUT-1869
URL: https://issues.apache.org/jira/browse/MAHOUT-1869
Project: Mahout
Issue Type: Story
Components: build, Classification, Collaborative Filtering, Math
Affects Versions: 1.0.0
Reporter: Saikat Kanjilal
Labels: build
Fix For: 1.0.0
Original Estimate: 1,008h
Remaining Estimate: 1,008h
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
github repo is here: https://github.com/skanjila/mahout, will send pull request when I have 1 algorithm operational
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Saikat Kanjilal (JIRA)
2016-06-12 03:20:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326158#comment-15326158 ]

Saikat Kanjilal commented on MAHOUT-1869:
-----------------------------------------

Got the code running, experiencing null pointer exception due to null DistributedContext, researching how this is getting created in math-scala module to try to emulate this workflow
Post by Saikat Kanjilal (JIRA)
Create a runtime performance measuring framework for mahout
-----------------------------------------------------------
Key: MAHOUT-1869
URL: https://issues.apache.org/jira/browse/MAHOUT-1869
Project: Mahout
Issue Type: Story
Components: build, Classification, Collaborative Filtering, Math
Affects Versions: 1.0.0
Reporter: Saikat Kanjilal
Labels: build
Fix For: 1.0.0
Original Estimate: 1,008h
Remaining Estimate: 1,008h
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
github repo is here: https://github.com/skanjila/mahout, will send pull request when I have 1 algorithm operational
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Saikat Kanjilal (JIRA)
2016-06-21 05:12:58 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341111#comment-15341111 ]

Saikat Kanjilal commented on MAHOUT-1869:
-----------------------------------------

Trying to compile the code with bare minimum and decipher how the DistributedContext is getting defined by mimicing the code in math_scala, got jar-with-dependencies target working in maven package target
Post by Saikat Kanjilal (JIRA)
Create a runtime performance measuring framework for mahout
-----------------------------------------------------------
Key: MAHOUT-1869
URL: https://issues.apache.org/jira/browse/MAHOUT-1869
Project: Mahout
Issue Type: Story
Components: build, Classification, Collaborative Filtering, Math
Affects Versions: 1.0.0
Reporter: Saikat Kanjilal
Labels: build
Fix For: 1.0.0
Original Estimate: 1,008h
Remaining Estimate: 1,008h
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
github repo is here: https://github.com/skanjila/mahout, will send pull request when I have 1 algorithm operational
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Saikat Kanjilal (JIRA)
2016-06-22 04:17:58 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15343683#comment-15343683 ]

Saikat Kanjilal commented on MAHOUT-1869:
-----------------------------------------

Got the code to build and measure performance of two methods ssvd and spca, github repo is here:

https://github.com/skanjila/mahout/tree/mahout-1869


Would love feedback before forging ahead and adding other methods. For now wont add method requiring a distributed context since thats a complicated ball of wax, will get to those in next phase.
Post by Saikat Kanjilal (JIRA)
Create a runtime performance measuring framework for mahout
-----------------------------------------------------------
Key: MAHOUT-1869
URL: https://issues.apache.org/jira/browse/MAHOUT-1869
Project: Mahout
Issue Type: Story
Components: build, Classification, Collaborative Filtering, Math
Affects Versions: 1.0.0
Reporter: Saikat Kanjilal
Labels: build
Fix For: 1.0.0
Original Estimate: 1,008h
Remaining Estimate: 1,008h
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
github repo is here: https://github.com/skanjila/mahout, will send pull request when I have 1 algorithm operational
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Saikat Kanjilal (JIRA)
2016-07-07 04:34:11 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365594#comment-15365594 ]

Saikat Kanjilal commented on MAHOUT-1869:
-----------------------------------------

Added the ability to create a csv file out of all the current timings, code/pull request is here:


https://github.com/apache/mahout/pull/245
Post by Saikat Kanjilal (JIRA)
Create a runtime performance measuring framework for mahout
-----------------------------------------------------------
Key: MAHOUT-1869
URL: https://issues.apache.org/jira/browse/MAHOUT-1869
Project: Mahout
Issue Type: Story
Components: build, Classification, Collaborative Filtering, Math
Affects Versions: 1.0.0
Reporter: Saikat Kanjilal
Labels: build
Fix For: 1.0.0
Original Estimate: 1,008h
Remaining Estimate: 1,008h
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
github repo is here: https://github.com/skanjila/mahout, will send pull request when I have 1 algorithm operational
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Saikat Kanjilal (JIRA)
2016-08-28 04:50:21 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15442772#comment-15442772 ]

Saikat Kanjilal commented on MAHOUT-1869:
-----------------------------------------

Moving on, added zeppelin target into perf project, design goal is the following:
1) when csv file gets printed out, plug that directly into one or more of the following zeppelin plots:
Pivot chart, xygraph

TBD
Should pick the simplest algorithm to do this, something like logistic regression or naive bayes, suggestions would be welcome here.
Post by Saikat Kanjilal (JIRA)
Create a runtime performance measuring framework for mahout
-----------------------------------------------------------
Key: MAHOUT-1869
URL: https://issues.apache.org/jira/browse/MAHOUT-1869
Project: Mahout
Issue Type: Story
Components: build, Classification, Collaborative Filtering, Math
Affects Versions: 1.0.0
Reporter: Saikat Kanjilal
Labels: build
Fix For: 1.0.0
Original Estimate: 1,008h
Remaining Estimate: 1,008h
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
github repo is here: https://github.com/skanjila/mahout, will send pull request when I have 1 algorithm operational
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Andrew Musselman (JIRA)
2016-08-28 05:17:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15442806#comment-15442806 ]

Andrew Musselman commented on MAHOUT-1869:
------------------------------------------

Will take a look this weekend; thanks!
Post by Saikat Kanjilal (JIRA)
Create a runtime performance measuring framework for mahout
-----------------------------------------------------------
Key: MAHOUT-1869
URL: https://issues.apache.org/jira/browse/MAHOUT-1869
Project: Mahout
Issue Type: Story
Components: build, Classification, Collaborative Filtering, Math
Affects Versions: 1.0.0
Reporter: Saikat Kanjilal
Labels: build
Fix For: 1.0.0
Original Estimate: 1,008h
Remaining Estimate: 1,008h
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
github repo is here: https://github.com/skanjila/mahout, will send pull request when I have 1 algorithm operational
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Saikat Kanjilal (JIRA)
2016-09-17 18:50:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15499501#comment-15499501 ]

Saikat Kanjilal commented on MAHOUT-1869:
-----------------------------------------

Addressed cr comments and committed changes to tip of mahout-1869 ,please take a look when you get a chance
Post by Saikat Kanjilal (JIRA)
Create a runtime performance measuring framework for mahout
-----------------------------------------------------------
Key: MAHOUT-1869
URL: https://issues.apache.org/jira/browse/MAHOUT-1869
Project: Mahout
Issue Type: Story
Components: build, Classification, Collaborative Filtering, Math
Affects Versions: 1.0.0
Reporter: Saikat Kanjilal
Labels: build
Fix For: 1.0.0
Original Estimate: 1,008h
Remaining Estimate: 1,008h
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
github repo is here: https://github.com/skanjila/mahout, will send pull request when I have 1 algorithm operational
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Saikat Kanjilal (JIRA)
2016-10-12 18:27:21 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Saikat Kanjilal resolved MAHOUT-1869.
-------------------------------------
Resolution: Won't Fix

After talking to Andrew Musselman I am closing this issue and focusing on bugs and documentation
Post by Saikat Kanjilal (JIRA)
Create a runtime performance measuring framework for mahout
-----------------------------------------------------------
Key: MAHOUT-1869
URL: https://issues.apache.org/jira/browse/MAHOUT-1869
Project: Mahout
Issue Type: Story
Components: build, Classification, Collaborative Filtering, Math
Affects Versions: 1.0.0
Reporter: Saikat Kanjilal
Labels: build
Fix For: 1.0.0
Original Estimate: 1,008h
Remaining Estimate: 1,008h
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
github repo is here: https://github.com/skanjila/mahout, will send pull request when I have 1 algorithm operational
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Loading...