Discussion:
[Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Saikat Kanjilal
2016-05-20 04:31:05 UTC
Permalink
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.


Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.


Clustering
The application will consist of a set of rest APIs to do the following:


a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets


/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets


The above API call will return a runId which the client program can then use to monitor the module




b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456




The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.




Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.






b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.

How does the application get runThe run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.


I will add more thoughts around this and create a JIRA ticket only once there's enough consensus between the committers that this is headed in the right direction. I will also add some more thoughts on measuring run time performance of some of the other algorithms after some more research.
Would love feedback or additional things to consider that I might have missed. If its more appropriate I can move the discussion to a jira ticket as well so please let me know.Thanks in advance.
Saikat Kanjilal
2016-06-03 04:35:54 UTC
Permalink
Hi All,Created a JIRA ticket and have moved the discussion for the runtime performance framework there:
https://issues.apache.org/jira/browse/MAHOUT-1869
@AndrewP & Trevor I would like to integrate zeppelin into the runtime performance measurement framework to output some measurement related data for some of the algorithms.
Should I wait till the zeppelin integration is completely working before I incorporate this piece?
Also would really some feedback either on the JIRA ticket or in response to this thread.Regards
Subject: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 19 May 2016 21:31:05 -0700
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get runThe run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
I will add more thoughts around this and create a JIRA ticket only once there's enough consensus between the committers that this is headed in the right direction. I will also add some more thoughts on measuring run time performance of some of the other algorithms after some more research.
Would love feedback or additional things to consider that I might have missed. If its more appropriate I can move the discussion to a jira ticket as well so please let me know.Thanks in advance.
Saikat Kanjilal
2016-06-06 15:58:49 UTC
Permalink
Andrew,Thanks for the input, I will shift gears a bit and just get some lightweight code going that calls into mahout algorithms and does a csv dump out. Note that I think akka could be a good use for this as you could make an async call and get back a notification when the csv dump is finished. Also I am indeed not focusing on mapreduce algorithms and will be tackling the algorithms in the math-scala library. What do you think of making this a lightweight web based workbench using spray that committers can run outside of mahout through curl or something, this was my initial vision in using spray and its good that I'm getting early feedback.

On zeppelin do you think its worthwhile that I incorporate Trevor's efforts to take that csv and turn that into one or two visualizations. I'm trying to understand how that effort may(or may not) intersect with what I'm trying to accomplish.
Also point taken on the small data sets.
Thanks
Subject: Re: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 15:50:16 +0000
Saikat,
If you're going to pursue this there is a few things that I would suggest. First, keep it light weight. We don't want to bring a a lot of extra dependencies or data into the distribution. I'm not sure what this means as far as spray/akka, but those seem like overkill in my opinion. This should be able to be kept down to a simple csv dump I think.
Second, use Data that can be either randomly generated with a seeded RNG, or a function like Mackey-Glass or downloaded (probably best), and only use a small very small sample in the tests- since they're pretty long currently. The main point being that we don't want to ship any large test datasets with the distro.
Third, we're not using MapReduce anymore, so focus on algorithms in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix algebra operations. That is where i see this being useful, so that we may compare changes and optimizations going forward.
Thanks,
Andy
________________________________________
Sent: Friday, June 3, 2016 12:35:54 AM
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
https://issues.apache.org/jira/browse/MAHOUT-1869
@AndrewP & Trevor I would like to integrate zeppelin into the runtime performance measurement framework to output some measurement related data for some of the algorithms.
Should I wait till the zeppelin integration is completely working before I incorporate this piece?
Also would really some feedback either on the JIRA ticket or in response to this thread.Regards
Subject: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 19 May 2016 21:31:05 -0700
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get runThe run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
I will add more thoughts around this and create a JIRA ticket only once there's enough consensus between the committers that this is headed in the right direction. I will also add some more thoughts on measuring run time performance of some of the other algorithms after some more research.
Would love feedback or additional things to consider that I might have missed. If its more appropriate I can move the discussion to a jira ticket as well so please let me know.Thanks in advance.
Trevor Grant
2016-06-06 16:33:32 UTC
Permalink
I can only chime in to the visualization part,

If you output to a csv- it can be easily consumed and visualized via
Zeppelin.

Specifically, there should be an exposed function where a csv (or even
better a tsv) string is generated, which can then be used by a 'write to
disk' method.

The tsv string can then be visualized in Zeppelin ala the %table interface
(which is angular based, but sufficient for many benchmarking applications)
or to R/Python -> (ggplot2,etc / matplotlib)

The moral of the story being the only thing needed to integrate with
Zeppelin would be a *.tsv file as a string.

tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Saikat Kanjilal
Andrew,Thanks for the input, I will shift gears a bit and just get some
lightweight code going that calls into mahout algorithms and does a csv
dump out. Note that I think akka could be a good use for this as you could
make an async call and get back a notification when the csv dump is
finished. Also I am indeed not focusing on mapreduce algorithms and will
be tackling the algorithms in the math-scala library. What do you think of
making this a lightweight web based workbench using spray that committers
can run outside of mahout through curl or something, this was my initial
vision in using spray and its good that I'm getting early feedback.
On zeppelin do you think its worthwhile that I incorporate Trevor's
efforts to take that csv and turn that into one or two visualizations. I'm
trying to understand how that effort may(or may not) intersect with what
I'm trying to accomplish.
Also point taken on the small data sets.
Thanks
Subject: Re: [Discuss--A proposal for building an application in mahout
to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 15:50:16 +0000
Saikat,
If you're going to pursue this there is a few things that I would
suggest. First, keep it light weight. We don't want to bring a a lot of
extra dependencies or data into the distribution. I'm not sure what this
means as far as spray/akka, but those seem like overkill in my opinion.
This should be able to be kept down to a simple csv dump I think.
Second, use Data that can be either randomly generated with a seeded
RNG, or a function like Mackey-Glass or downloaded (probably best), and
only use a small very small sample in the tests- since they're pretty long
currently. The main point being that we don't want to ship any large test
datasets with the distro.
Third, we're not using MapReduce anymore, so focus on algorithms in the
math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix
algebra operations. That is where i see this being useful, so that we may
compare changes and optimizations going forward.
Thanks,
Andy
________________________________________
Sent: Friday, June 3, 2016 12:35:54 AM
Subject: RE: [Discuss--A proposal for building an application in mahout
to measure runtime performance of algorithms in mahout]
Hi All,Created a JIRA ticket and have moved the discussion for the
https://issues.apache.org/jira/browse/MAHOUT-1869
@AndrewP & Trevor I would like to integrate zeppelin into the runtime
performance measurement framework to output some measurement related data
for some of the algorithms.
Should I wait till the zeppelin integration is completely working before
I incorporate this piece?
Also would really some feedback either on the JIRA ticket or in response
to this thread.Regards
Subject: [Discuss--A proposal for building an application in mahout to
measure runtime performance of algorithms in mahout]
Date: Thu, 19 May 2016 21:31:05 -0700
This proposal will outline a runtime performance module used to
measure the performance of various algorithms in mahout in the three major
areas, clustering, regression and classification. The module will be a
spray/scala/akka application which will be run by any current or new
algorithm in mahout and will display a csv file and a set of zeppelin plots
outlining the various criteria for performance. The goal of releasing
any new build in mahout will be to run a set of tests for each of the
algorithms to compare and contrast some benchmarks from one release to
another.
Architecture
The run time performance application will run on top of spray/scala
and akka and will make async api calls into the various mahout algorithms
to generate a cvs file containing data representing the run time
performance measurement calculations for each algorithm of interest as well
as a set of zeppelin plots for displaying some of these results. The spray
scala architecture will leverage the zeppelin server to create the
visualizations. The discussion below centers around two types of
algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as
inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a
set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
and finally a set of values for the number of clusters to use for each of
the different sizes of the datasets
The above API call will return a runId which the client program can
then use to monitor the module
b) A method to monitor the application to ensure that its making
progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the
mahout kmeans (fuzzy kmeans) clustering implementations and will generate
zeppelin plots showing the normalized time on the y axis and the number of
clusters in the x axis. The spray/scala akka framework will allow the
client application to receive a callback when the run time performance
calculations are actually completed. For now the calculations for
measuring run time performance will contain: a) the ratio of the number of
points clustered correctly to the total number of points b) the total time
taken for the algorithm to run . These items will be represented in
separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test
with a different set of features in every run . We will introduce a rest
API to run the likelihood ratio test and return the results, this will once
again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics
for every algorithm: 1) cpu usage 2) memory usage 3) time taken for
algorithm to converge and run to completion. These metrics will be
reported on top of the zeppelin graphs for both the regression and the
different clustering algorithms mentioned above.
How does the application get runThe run time performance measuring
application will get invoked from the command line, eventually it would be
worthwhile to hook this into some sort of integration test suite to certify
the different mahout releases.
I will add more thoughts around this and create a JIRA ticket only
once there's enough consensus between the committers that this is headed in
the right direction. I will also add some more thoughts on measuring run
time performance of some of the other algorithms after some more research.
Would love feedback or additional things to consider that I might have
missed. If its more appropriate I can move the discussion to a jira ticket
as well so please let me know.Thanks in advance.
Saikat Kanjilal
2016-06-06 17:12:59 UTC
Permalink
Perfect, thank you, that helps scope the effort a bit more accurately.
Date: Mon, 6 Jun 2016 11:33:32 -0500
Subject: Re: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
I can only chime in to the visualization part,
If you output to a csv- it can be easily consumed and visualized via
Zeppelin.
Specifically, there should be an exposed function where a csv (or even
better a tsv) string is generated, which can then be used by a 'write to
disk' method.
The tsv string can then be visualized in Zeppelin ala the %table interface
(which is angular based, but sufficient for many benchmarking applications)
or to R/Python -> (ggplot2,etc / matplotlib)
The moral of the story being the only thing needed to integrate with
Zeppelin would be a *.tsv file as a string.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Saikat Kanjilal
Andrew,Thanks for the input, I will shift gears a bit and just get some
lightweight code going that calls into mahout algorithms and does a csv
dump out. Note that I think akka could be a good use for this as you could
make an async call and get back a notification when the csv dump is
finished. Also I am indeed not focusing on mapreduce algorithms and will
be tackling the algorithms in the math-scala library. What do you think of
making this a lightweight web based workbench using spray that committers
can run outside of mahout through curl or something, this was my initial
vision in using spray and its good that I'm getting early feedback.
On zeppelin do you think its worthwhile that I incorporate Trevor's
efforts to take that csv and turn that into one or two visualizations. I'm
trying to understand how that effort may(or may not) intersect with what
I'm trying to accomplish.
Also point taken on the small data sets.
Thanks
Subject: Re: [Discuss--A proposal for building an application in mahout
to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 15:50:16 +0000
Saikat,
If you're going to pursue this there is a few things that I would
suggest. First, keep it light weight. We don't want to bring a a lot of
extra dependencies or data into the distribution. I'm not sure what this
means as far as spray/akka, but those seem like overkill in my opinion.
This should be able to be kept down to a simple csv dump I think.
Second, use Data that can be either randomly generated with a seeded
RNG, or a function like Mackey-Glass or downloaded (probably best), and
only use a small very small sample in the tests- since they're pretty long
currently. The main point being that we don't want to ship any large test
datasets with the distro.
Third, we're not using MapReduce anymore, so focus on algorithms in the
math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix
algebra operations. That is where i see this being useful, so that we may
compare changes and optimizations going forward.
Thanks,
Andy
________________________________________
Sent: Friday, June 3, 2016 12:35:54 AM
Subject: RE: [Discuss--A proposal for building an application in mahout
to measure runtime performance of algorithms in mahout]
Hi All,Created a JIRA ticket and have moved the discussion for the
https://issues.apache.org/jira/browse/MAHOUT-1869
@AndrewP & Trevor I would like to integrate zeppelin into the runtime
performance measurement framework to output some measurement related data
for some of the algorithms.
Should I wait till the zeppelin integration is completely working before
I incorporate this piece?
Also would really some feedback either on the JIRA ticket or in response
to this thread.Regards
Subject: [Discuss--A proposal for building an application in mahout to
measure runtime performance of algorithms in mahout]
Date: Thu, 19 May 2016 21:31:05 -0700
This proposal will outline a runtime performance module used to
measure the performance of various algorithms in mahout in the three major
areas, clustering, regression and classification. The module will be a
spray/scala/akka application which will be run by any current or new
algorithm in mahout and will display a csv file and a set of zeppelin plots
outlining the various criteria for performance. The goal of releasing
any new build in mahout will be to run a set of tests for each of the
algorithms to compare and contrast some benchmarks from one release to
another.
Architecture
The run time performance application will run on top of spray/scala
and akka and will make async api calls into the various mahout algorithms
to generate a cvs file containing data representing the run time
performance measurement calculations for each algorithm of interest as well
as a set of zeppelin plots for displaying some of these results. The spray
scala architecture will leverage the zeppelin server to create the
visualizations. The discussion below centers around two types of
algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as
inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a
set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
and finally a set of values for the number of clusters to use for each of
the different sizes of the datasets
The above API call will return a runId which the client program can
then use to monitor the module
b) A method to monitor the application to ensure that its making
progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the
mahout kmeans (fuzzy kmeans) clustering implementations and will generate
zeppelin plots showing the normalized time on the y axis and the number of
clusters in the x axis. The spray/scala akka framework will allow the
client application to receive a callback when the run time performance
calculations are actually completed. For now the calculations for
measuring run time performance will contain: a) the ratio of the number of
points clustered correctly to the total number of points b) the total time
taken for the algorithm to run . These items will be represented in
separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test
with a different set of features in every run . We will introduce a rest
API to run the likelihood ratio test and return the results, this will once
again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics
for every algorithm: 1) cpu usage 2) memory usage 3) time taken for
algorithm to converge and run to completion. These metrics will be
reported on top of the zeppelin graphs for both the regression and the
different clustering algorithms mentioned above.
How does the application get runThe run time performance measuring
application will get invoked from the command line, eventually it would be
worthwhile to hook this into some sort of integration test suite to certify
the different mahout releases.
I will add more thoughts around this and create a JIRA ticket only
once there's enough consensus between the committers that this is headed in
the right direction. I will also add some more thoughts on measuring run
time performance of some of the other algorithms after some more research.
Would love feedback or additional things to consider that I might have
missed. If its more appropriate I can move the discussion to a jira ticket
as well so please let me know.Thanks in advance.
Saikat Kanjilal
2016-06-10 04:45:13 UTC
Permalink
Andrew et al,So I've finally been able to over the past few days got a self contained module compiling that leverages the DistributedContext, for starters I copied the NaiveBayes test code, ripped out the test infrastructure code around it and then added some timers, next steps will be to dump to csv and eventually to zeppelin, some questions before I get too far ahead:
1) I made the design decision to create my own trait and encapsulate the context within that, I am wondering if I should instead leverage the context that is already defined in math-scala ,, this however brings its own complications in that it brings in the MahoutSuite which I'm not sure I really need, thoughts on this
2) I need some infrastructure to run the perf framework , I can use an azure ubuntu vm for now but is there an AWS instance or some other vm I can eventually use, I would really like to avoid using my mac laptop as a runtime perf testing environment

Thanks, I'll update JIRA as I make more headway.
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 08:58:49 -0700
Andrew,Thanks for the input, I will shift gears a bit and just get some lightweight code going that calls into mahout algorithms and does a csv dump out. Note that I think akka could be a good use for this as you could make an async call and get back a notification when the csv dump is finished. Also I am indeed not focusing on mapreduce algorithms and will be tackling the algorithms in the math-scala library. What do you think of making this a lightweight web based workbench using spray that committers can run outside of mahout through curl or something, this was my initial vision in using spray and its good that I'm getting early feedback.
On zeppelin do you think its worthwhile that I incorporate Trevor's efforts to take that csv and turn that into one or two visualizations. I'm trying to understand how that effort may(or may not) intersect with what I'm trying to accomplish.
Also point taken on the small data sets.
Thanks
Subject: Re: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 15:50:16 +0000
Saikat,
If you're going to pursue this there is a few things that I would suggest. First, keep it light weight. We don't want to bring a a lot of extra dependencies or data into the distribution. I'm not sure what this means as far as spray/akka, but those seem like overkill in my opinion. This should be able to be kept down to a simple csv dump I think.
Second, use Data that can be either randomly generated with a seeded RNG, or a function like Mackey-Glass or downloaded (probably best), and only use a small very small sample in the tests- since they're pretty long currently. The main point being that we don't want to ship any large test datasets with the distro.
Third, we're not using MapReduce anymore, so focus on algorithms in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix algebra operations. That is where i see this being useful, so that we may compare changes and optimizations going forward.
Thanks,
Andy
________________________________________
Sent: Friday, June 3, 2016 12:35:54 AM
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
https://issues.apache.org/jira/browse/MAHOUT-1869
@AndrewP & Trevor I would like to integrate zeppelin into the runtime performance measurement framework to output some measurement related data for some of the algorithms.
Should I wait till the zeppelin integration is completely working before I incorporate this piece?
Also would really some feedback either on the JIRA ticket or in response to this thread.Regards
Subject: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 19 May 2016 21:31:05 -0700
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get runThe run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
I will add more thoughts around this and create a JIRA ticket only once there's enough consensus between the committers that this is headed in the right direction. I will also add some more thoughts on measuring run time performance of some of the other algorithms after some more research.
Would love feedback or additional things to consider that I might have missed. If its more appropriate I can move the discussion to a jira ticket as well so please let me know.Thanks in advance.
Saikat Kanjilal
2016-06-12 19:40:26 UTC
Permalink
Hi Folks,I need some input/help here to get me unblocked and moving:
1) I need to reuse/extend the DistributedContext inside the runtime perf measurement module as all algorithms inside math-scala need this, I was trying to mimic some of the H2O code and saw that they had their own engine, I am wondering what the best way is to extend DistributedContext and get the benefit of an already existing engine without needing to tie into h2o or flink, or is the only way to add an engine to point to one of those back ends, ideally I want to build the runtime perf module in a backend agnostic way and currently I dont see a way around this, thoughts?2) I also tried to reuse some of the logic inside math-scala but in digging into this code it seems that this code is strongly tied to scala test utilities

Net-Net: I just need access to the DistributedContext without linking in any test utilities or backends.
Would love some advice on ways to move forward to maximize reuse.Thanks in advance.
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 9 Jun 2016 21:45:13 -0700
1) I made the design decision to create my own trait and encapsulate the context within that, I am wondering if I should instead leverage the context that is already defined in math-scala ,, this however brings its own complications in that it brings in the MahoutSuite which I'm not sure I really need, thoughts on this
2) I need some infrastructure to run the perf framework , I can use an azure ubuntu vm for now but is there an AWS instance or some other vm I can eventually use, I would really like to avoid using my mac laptop as a runtime perf testing environment
Thanks, I'll update JIRA as I make more headway.
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 08:58:49 -0700
Andrew,Thanks for the input, I will shift gears a bit and just get some lightweight code going that calls into mahout algorithms and does a csv dump out. Note that I think akka could be a good use for this as you could make an async call and get back a notification when the csv dump is finished. Also I am indeed not focusing on mapreduce algorithms and will be tackling the algorithms in the math-scala library. What do you think of making this a lightweight web based workbench using spray that committers can run outside of mahout through curl or something, this was my initial vision in using spray and its good that I'm getting early feedback.
On zeppelin do you think its worthwhile that I incorporate Trevor's efforts to take that csv and turn that into one or two visualizations. I'm trying to understand how that effort may(or may not) intersect with what I'm trying to accomplish.
Also point taken on the small data sets.
Thanks
Subject: Re: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 15:50:16 +0000
Saikat,
If you're going to pursue this there is a few things that I would suggest. First, keep it light weight. We don't want to bring a a lot of extra dependencies or data into the distribution. I'm not sure what this means as far as spray/akka, but those seem like overkill in my opinion. This should be able to be kept down to a simple csv dump I think.
Second, use Data that can be either randomly generated with a seeded RNG, or a function like Mackey-Glass or downloaded (probably best), and only use a small very small sample in the tests- since they're pretty long currently. The main point being that we don't want to ship any large test datasets with the distro.
Third, we're not using MapReduce anymore, so focus on algorithms in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix algebra operations. That is where i see this being useful, so that we may compare changes and optimizations going forward.
Thanks,
Andy
________________________________________
Sent: Friday, June 3, 2016 12:35:54 AM
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
https://issues.apache.org/jira/browse/MAHOUT-1869
@AndrewP & Trevor I would like to integrate zeppelin into the runtime performance measurement framework to output some measurement related data for some of the algorithms.
Should I wait till the zeppelin integration is completely working before I incorporate this piece?
Also would really some feedback either on the JIRA ticket or in response to this thread.Regards
Subject: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 19 May 2016 21:31:05 -0700
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get runThe run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
I will add more thoughts around this and create a JIRA ticket only once there's enough consensus between the committers that this is headed in the right direction. I will also add some more thoughts on measuring run time performance of some of the other algorithms after some more research.
Would love feedback or additional things to consider that I might have missed. If its more appropriate I can move the discussion to a jira ticket as well so please let me know.Thanks in advance.
Saikat Kanjilal
2016-06-21 03:37:31 UTC
Permalink
AndrewP et al,Any chance I can get some pointers on the items below, would love some direction on this.Thanks
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Sun, 12 Jun 2016 12:40:26 -0700
1) I need to reuse/extend the DistributedContext inside the runtime perf measurement module as all algorithms inside math-scala need this, I was trying to mimic some of the H2O code and saw that they had their own engine, I am wondering what the best way is to extend DistributedContext and get the benefit of an already existing engine without needing to tie into h2o or flink, or is the only way to add an engine to point to one of those back ends, ideally I want to build the runtime perf module in a backend agnostic way and currently I dont see a way around this, thoughts?2) I also tried to reuse some of the logic inside math-scala but in digging into this code it seems that this code is strongly tied to scala test utilities
Net-Net: I just need access to the DistributedContext without linking in any test utilities or backends.
Would love some advice on ways to move forward to maximize reuse.Thanks in advance.
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 9 Jun 2016 21:45:13 -0700
1) I made the design decision to create my own trait and encapsulate the context within that, I am wondering if I should instead leverage the context that is already defined in math-scala ,, this however brings its own complications in that it brings in the MahoutSuite which I'm not sure I really need, thoughts on this
2) I need some infrastructure to run the perf framework , I can use an azure ubuntu vm for now but is there an AWS instance or some other vm I can eventually use, I would really like to avoid using my mac laptop as a runtime perf testing environment
Thanks, I'll update JIRA as I make more headway.
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 08:58:49 -0700
Andrew,Thanks for the input, I will shift gears a bit and just get some lightweight code going that calls into mahout algorithms and does a csv dump out. Note that I think akka could be a good use for this as you could make an async call and get back a notification when the csv dump is finished. Also I am indeed not focusing on mapreduce algorithms and will be tackling the algorithms in the math-scala library. What do you think of making this a lightweight web based workbench using spray that committers can run outside of mahout through curl or something, this was my initial vision in using spray and its good that I'm getting early feedback.
On zeppelin do you think its worthwhile that I incorporate Trevor's efforts to take that csv and turn that into one or two visualizations. I'm trying to understand how that effort may(or may not) intersect with what I'm trying to accomplish.
Also point taken on the small data sets.
Thanks
Subject: Re: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 15:50:16 +0000
Saikat,
If you're going to pursue this there is a few things that I would suggest. First, keep it light weight. We don't want to bring a a lot of extra dependencies or data into the distribution. I'm not sure what this means as far as spray/akka, but those seem like overkill in my opinion. This should be able to be kept down to a simple csv dump I think.
Second, use Data that can be either randomly generated with a seeded RNG, or a function like Mackey-Glass or downloaded (probably best), and only use a small very small sample in the tests- since they're pretty long currently. The main point being that we don't want to ship any large test datasets with the distro.
Third, we're not using MapReduce anymore, so focus on algorithms in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix algebra operations. That is where i see this being useful, so that we may compare changes and optimizations going forward.
Thanks,
Andy
________________________________________
Sent: Friday, June 3, 2016 12:35:54 AM
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
https://issues.apache.org/jira/browse/MAHOUT-1869
@AndrewP & Trevor I would like to integrate zeppelin into the runtime performance measurement framework to output some measurement related data for some of the algorithms.
Should I wait till the zeppelin integration is completely working before I incorporate this piece?
Also would really some feedback either on the JIRA ticket or in response to this thread.Regards
Subject: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 19 May 2016 21:31:05 -0700
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get runThe run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
I will add more thoughts around this and create a JIRA ticket only once there's enough consensus between the committers that this is headed in the right direction. I will also add some more thoughts on measuring run time performance of some of the other algorithms after some more research.
Would love feedback or additional things to consider that I might have missed. If its more appropriate I can move the discussion to a jira ticket as well so please let me know.Thanks in advance.
Saikat Kanjilal
2016-06-22 04:21:27 UTC
Permalink
Ok, so for now I am able to get around the issues bwlow by working on code to measure performance times not requiring the notion of a DIstributedContext to get this up and running, I have two methods that I am measuring performance times for,ssvd and spca. Github repo is here:
https://github.com/skanjila/mahout/tree/mahout-1869
Please provide feedback as I will now restructure/reorganize code to add more methods and start work on a perf harness that spits out a report in csv and then eventually tie this to zeppelin.
I've kept JIRA up to date as well.
Thanks in advance.
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 20 Jun 2016 20:37:31 -0700
AndrewP et al,Any chance I can get some pointers on the items below, would love some direction on this.Thanks
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Sun, 12 Jun 2016 12:40:26 -0700
1) I need to reuse/extend the DistributedContext inside the runtime perf measurement module as all algorithms inside math-scala need this, I was trying to mimic some of the H2O code and saw that they had their own engine, I am wondering what the best way is to extend DistributedContext and get the benefit of an already existing engine without needing to tie into h2o or flink, or is the only way to add an engine to point to one of those back ends, ideally I want to build the runtime perf module in a backend agnostic way and currently I dont see a way around this, thoughts?2) I also tried to reuse some of the logic inside math-scala but in digging into this code it seems that this code is strongly tied to scala test utilities
Net-Net: I just need access to the DistributedContext without linking in any test utilities or backends.
Would love some advice on ways to move forward to maximize reuse.Thanks in advance.
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 9 Jun 2016 21:45:13 -0700
1) I made the design decision to create my own trait and encapsulate the context within that, I am wondering if I should instead leverage the context that is already defined in math-scala ,, this however brings its own complications in that it brings in the MahoutSuite which I'm not sure I really need, thoughts on this
2) I need some infrastructure to run the perf framework , I can use an azure ubuntu vm for now but is there an AWS instance or some other vm I can eventually use, I would really like to avoid using my mac laptop as a runtime perf testing environment
Thanks, I'll update JIRA as I make more headway.
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 08:58:49 -0700
Andrew,Thanks for the input, I will shift gears a bit and just get some lightweight code going that calls into mahout algorithms and does a csv dump out. Note that I think akka could be a good use for this as you could make an async call and get back a notification when the csv dump is finished. Also I am indeed not focusing on mapreduce algorithms and will be tackling the algorithms in the math-scala library. What do you think of making this a lightweight web based workbench using spray that committers can run outside of mahout through curl or something, this was my initial vision in using spray and its good that I'm getting early feedback.
On zeppelin do you think its worthwhile that I incorporate Trevor's efforts to take that csv and turn that into one or two visualizations. I'm trying to understand how that effort may(or may not) intersect with what I'm trying to accomplish.
Also point taken on the small data sets.
Thanks
Subject: Re: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 15:50:16 +0000
Saikat,
If you're going to pursue this there is a few things that I would suggest. First, keep it light weight. We don't want to bring a a lot of extra dependencies or data into the distribution. I'm not sure what this means as far as spray/akka, but those seem like overkill in my opinion. This should be able to be kept down to a simple csv dump I think.
Second, use Data that can be either randomly generated with a seeded RNG, or a function like Mackey-Glass or downloaded (probably best), and only use a small very small sample in the tests- since they're pretty long currently. The main point being that we don't want to ship any large test datasets with the distro.
Third, we're not using MapReduce anymore, so focus on algorithms in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix algebra operations. That is where i see this being useful, so that we may compare changes and optimizations going forward.
Thanks,
Andy
________________________________________
Sent: Friday, June 3, 2016 12:35:54 AM
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
https://issues.apache.org/jira/browse/MAHOUT-1869
@AndrewP & Trevor I would like to integrate zeppelin into the runtime performance measurement framework to output some measurement related data for some of the algorithms.
Should I wait till the zeppelin integration is completely working before I incorporate this piece?
Also would really some feedback either on the JIRA ticket or in response to this thread.Regards
Subject: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 19 May 2016 21:31:05 -0700
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get runThe run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
I will add more thoughts around this and create a JIRA ticket only once there's enough consensus between the committers that this is headed in the right direction. I will also add some more thoughts on measuring run time performance of some of the other algorithms after some more research.
Would love feedback or additional things to consider that I might have missed. If its more appropriate I can move the discussion to a jira ticket as well so please let me know.Thanks in advance.
Saikat Kanjilal
2016-07-07 04:37:48 UTC
Permalink
Ok folks I've created a pull request here for a barebones runtime performance measurement framework that:

1) measures two simple timings from ssvd and spca

2) dumps these timings into a csv file


https://github.com/apache/mahout/pull/245

[https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<https://github.com/apache/mahout/pull/245>

Mahout 1869 by skanjila · Pull Request #245 · apache/mahout<https://github.com/apache/mahout/pull/245>
github.com
Added the ability to dump output to csv file





I'd be greatly appreciative if I can get some early feedback on design before moving forward and making too many more changes and not getting something included. I will move ahead with the zeppelin integration in a few days and reorganize the code a bit to include all the perf related pieces into one class or trait.


Thanks in advance for your help.


Thanks in advance.


________________________________
From: Saikat Kanjilal <***@hotmail.com>
Sent: Tuesday, June 21, 2016 9:21 PM
To: ***@mahout.apache.org
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]

Ok, so for now I am able to get around the issues bwlow by working on code to measure performance times not requiring the notion of a DIstributedContext to get this up and running, I have two methods that I am measuring performance times for,ssvd and spca. Github repo is here:
https://github.com/skanjila/mahout/tree/mahout-1869
[https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<https://github.com/skanjila/mahout/tree/mahout-1869>

skanjila/mahout<https://github.com/skanjila/mahout/tree/mahout-1869>
github.com
mahout - Mirror of Apache Mahout



Please provide feedback as I will now restructure/reorganize code to add more methods and start work on a perf harness that spits out a report in csv and then eventually tie this to zeppelin.
I've kept JIRA up to date as well.
Thanks in advance.
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 20 Jun 2016 20:37:31 -0700
AndrewP et al,Any chance I can get some pointers on the items below, would love some direction on this.Thanks
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Sun, 12 Jun 2016 12:40:26 -0700
1) I need to reuse/extend the DistributedContext inside the runtime perf measurement module as all algorithms inside math-scala need this, I was trying to mimic some of the H2O code and saw that they had their own engine, I am wondering what the best way is to extend DistributedContext and get the benefit of an already existing engine without needing to tie into h2o or flink, or is the only way to add an engine to point to one of those back ends, ideally I want to build the runtime perf module in a backend agnostic way and currently I dont see a way around this, thoughts?2) I also tried to reuse some of the logic inside math-scala but in digging into this code it seems that this code is strongly tied to scala test utilities
Net-Net: I just need access to the DistributedContext without linking in any test utilities or backends.
Would love some advice on ways to move forward to maximize reuse.Thanks in advance.
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 9 Jun 2016 21:45:13 -0700
1) I made the design decision to create my own trait and encapsulate the context within that, I am wondering if I should instead leverage the context that is already defined in math-scala ,, this however brings its own complications in that it brings in the MahoutSuite which I'm not sure I really need, thoughts on this
2) I need some infrastructure to run the perf framework , I can use an azure ubuntu vm for now but is there an AWS instance or some other vm I can eventually use, I would really like to avoid using my mac laptop as a runtime perf testing environment
Thanks, I'll update JIRA as I make more headway.
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 08:58:49 -0700
Andrew,Thanks for the input, I will shift gears a bit and just get some lightweight code going that calls into mahout algorithms and does a csv dump out. Note that I think akka could be a good use for this as you could make an async call and get back a notification when the csv dump is finished. Also I am indeed not focusing on mapreduce algorithms and will be tackling the algorithms in the math-scala library. What do you think of making this a lightweight web based workbench using spray that committers can run outside of mahout through curl or something, this was my initial vision in using spray and its good that I'm getting early feedback.
On zeppelin do you think its worthwhile that I incorporate Trevor's efforts to take that csv and turn that into one or two visualizations. I'm trying to understand how that effort may(or may not) intersect with what I'm trying to accomplish.
Also point taken on the small data sets.
Thanks
Subject: Re: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 15:50:16 +0000
Saikat,
If you're going to pursue this there is a few things that I would suggest. First, keep it light weight. We don't want to bring a a lot of extra dependencies or data into the distribution. I'm not sure what this means as far as spray/akka, but those seem like overkill in my opinion. This should be able to be kept down to a simple csv dump I think.
Second, use Data that can be either randomly generated with a seeded RNG, or a function like Mackey-Glass or downloaded (probably best), and only use a small very small sample in the tests- since they're pretty long currently. The main point being that we don't want to ship any large test datasets with the distro.
Third, we're not using MapReduce anymore, so focus on algorithms in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix algebra operations. That is where i see this being useful, so that we may compare changes and optimizations going forward.
Thanks,
Andy
________________________________________
Sent: Friday, June 3, 2016 12:35:54 AM
Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
https://issues.apache.org/jira/browse/MAHOUT-1869
@AndrewP & Trevor I would like to integrate zeppelin into the runtime performance measurement framework to output some measurement related data for some of the algorithms.
Should I wait till the zeppelin integration is completely working before I incorporate this piece?
Also would really some feedback either on the JIRA ticket or in response to this thread.Regards
Subject: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 19 May 2016 21:31:05 -0700
This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
Architecture
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
Clustering
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get runThe run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.
I will add more thoughts around this and create a JIRA ticket only once there's enough consensus between the committers that this is headed in the right direction. I will also add some more thoughts on measuring run time performance of some of the other algorithms after some more research.
Would love feedback or additional things to consider that I might have missed. If its more appropriate I can move the discussion to a jira ticket as well so please let me know.Thanks in advance.
Andrew Musselman
2016-07-07 15:15:24 UTC
Permalink
Excellent, thanks Saikat; I'll be able to take a look over the weekend.
Post by Saikat Kanjilal
Ok folks I've created a pull request here for a barebones runtime
1) measures two simple timings from ssvd and spca
2) dumps these timings into a csv file
https://github.com/apache/mahout/pull/245
[https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<
https://github.com/apache/mahout/pull/245>
Mahout 1869 by skanjila · Pull Request #245 · apache/mahout<
https://github.com/apache/mahout/pull/245>
github.com
Added the ability to dump output to csv file
I'd be greatly appreciative if I can get some early feedback on design
before moving forward and making too many more changes and not getting
something included. I will move ahead with the zeppelin integration in a
few days and reorganize the code a bit to include all the perf related
pieces into one class or trait.
Thanks in advance for your help.
Thanks in advance.
________________________________
Sent: Tuesday, June 21, 2016 9:21 PM
Subject: RE: [Discuss--A proposal for building an application in mahout to
measure runtime performance of algorithms in mahout]
Ok, so for now I am able to get around the issues bwlow by working on code
to measure performance times not requiring the notion of a
DIstributedContext to get this up and running, I have two methods that I am
https://github.com/skanjila/mahout/tree/mahout-1869
[https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<
https://github.com/skanjila/mahout/tree/mahout-1869>
skanjila/mahout<https://github.com/skanjila/mahout/tree/mahout-1869>
github.com
mahout - Mirror of Apache Mahout
Please provide feedback as I will now restructure/reorganize code to add
more methods and start work on a perf harness that spits out a report in
csv and then eventually tie this to zeppelin.
I've kept JIRA up to date as well.
Thanks in advance.
Subject: RE: [Discuss--A proposal for building an application in mahout
to measure runtime performance of algorithms in mahout]
Date: Mon, 20 Jun 2016 20:37:31 -0700
AndrewP et al,Any chance I can get some pointers on the items below,
would love some direction on this.Thanks
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Sun, 12 Jun 2016 12:40:26 -0700
1) I need to reuse/extend the DistributedContext inside the runtime
perf measurement module as all algorithms inside math-scala need this, I
was trying to mimic some of the H2O code and saw that they had their own
engine, I am wondering what the best way is to extend DistributedContext
and get the benefit of an already existing engine without needing to tie
into h2o or flink, or is the only way to add an engine to point to one of
those back ends, ideally I want to build the runtime perf module in a
backend agnostic way and currently I dont see a way around this,
thoughts?2) I also tried to reuse some of the logic inside math-scala but
in digging into this code it seems that this code is strongly tied to scala
test utilities
Net-Net: I just need access to the DistributedContext without linking
in any test utilities or backends.
Would love some advice on ways to move forward to maximize
reuse.Thanks in advance.
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 9 Jun 2016 21:45:13 -0700
Andrew et al,So I've finally been able to over the past few days got
a self contained module compiling that leverages the DistributedContext,
for starters I copied the NaiveBayes test code, ripped out the test
infrastructure code around it and then added some timers, next steps will
be to dump to csv and eventually to zeppelin, some questions before I get
1) I made the design decision to create my own trait and encapsulate
the context within that, I am wondering if I should instead leverage the
context that is already defined in math-scala ,, this however brings its
own complications in that it brings in the MahoutSuite which I'm not sure I
really need, thoughts on this
2) I need some infrastructure to run the perf framework , I can use
an azure ubuntu vm for now but is there an AWS instance or some other vm I
can eventually use, I would really like to avoid using my mac laptop as a
runtime perf testing environment
Thanks, I'll update JIRA as I make more headway.
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 08:58:49 -0700
Andrew,Thanks for the input, I will shift gears a bit and just get
some lightweight code going that calls into mahout algorithms and does a
csv dump out. Note that I think akka could be a good use for this as you
could make an async call and get back a notification when the csv dump is
finished. Also I am indeed not focusing on mapreduce algorithms and will
be tackling the algorithms in the math-scala library. What do you think of
making this a lightweight web based workbench using spray that committers
can run outside of mahout through curl or something, this was my initial
vision in using spray and its good that I'm getting early feedback.
On zeppelin do you think its worthwhile that I incorporate
Trevor's efforts to take that csv and turn that into one or two
visualizations. I'm trying to understand how that effort may(or may not)
intersect with what I'm trying to accomplish.
Also point taken on the small data sets.
Thanks
Subject: Re: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 15:50:16 +0000
Saikat,
If you're going to pursue this there is a few things that I
would suggest. First, keep it light weight. We don't want to bring a a
lot of extra dependencies or data into the distribution. I'm not sure what
this means as far as spray/akka, but those seem like overkill in my
opinion. This should be able to be kept down to a simple csv dump I think.
Second, use Data that can be either randomly generated with a
seeded RNG, or a function like Mackey-Glass or downloaded (probably best),
and only use a small very small sample in the tests- since they're pretty
long currently. The main point being that we don't want to ship any large
test datasets with the distro.
Third, we're not using MapReduce anymore, so focus on algorithms
in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix
algebra operations. That is where i see this being useful, so that we may
compare changes and optimizations going forward.
Thanks,
Andy
________________________________________
Sent: Friday, June 3, 2016 12:35:54 AM
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Hi All,Created a JIRA ticket and have moved the discussion for
https://issues.apache.org/jira/browse/MAHOUT-1869
@AndrewP & Trevor I would like to integrate zeppelin into the
runtime performance measurement framework to output some measurement
related data for some of the algorithms.
Should I wait till the zeppelin integration is completely
working before I incorporate this piece?
Also would really some feedback either on the JIRA ticket or in
response to this thread.Regards
Subject: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 19 May 2016 21:31:05 -0700
This proposal will outline a runtime performance module used
to measure the performance of various algorithms in mahout in the three
major areas, clustering, regression and classification. The module will be
a spray/scala/akka application which will be run by any current or new
algorithm in mahout and will display a csv file and a set of zeppelin plots
outlining the various criteria for performance. The goal of releasing
any new build in mahout will be to run a set of tests for each of the
algorithms to compare and contrast some benchmarks from one release to
another.
Architecture
The run time performance application will run on top of
spray/scala and akka and will make async api calls into the various mahout
algorithms to generate a cvs file containing data representing the run time
performance measurement calculations for each algorithm of interest as well
as a set of zeppelin plots for displaying some of these results. The spray
scala architecture will leverage the zeppelin server to create the
visualizations. The discussion below centers around two types of
algorithms to be addressed by the application.
Clustering
The application will consist of a set of rest APIs to do the
a) A method to load and execute the run time perf module and
takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a
location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
and finally a set of values for the number of clusters to use for each of
the different sizes of the datasets
The above API call will return a runId which the client
program can then use to monitor the module
b) A method to monitor the application to ensure that its
making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into
the mahout kmeans (fuzzy kmeans) clustering implementations and will
generate zeppelin plots showing the normalized time on the y axis and the
number of clusters in the x axis. The spray/scala akka framework will
allow the client application to receive a callback when the run time
performance calculations are actually completed. For now the calculations
for measuring run time performance will contain: a) the ratio of the number
of points clustered correctly to the total number of points b) the total
time taken for the algorithm to run . These items will be represented in
separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood
ratio test with a different set of features in every run . We will
introduce a rest API to run the likelihood ratio test and return the
results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following
metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for
algorithm to converge and run to completion. These metrics will be
reported on top of the zeppelin graphs for both the regression and the
different clustering algorithms mentioned above.
How does the application get runThe run time performance
measuring application will get invoked from the command line, eventually it
would be worthwhile to hook this into some sort of integration test suite
to certify the different mahout releases.
I will add more thoughts around this and create a JIRA ticket
only once there's enough consensus between the committers that this is
headed in the right direction. I will also add some more thoughts on
measuring run time performance of some of the other algorithms after some
more research.
Would love feedback or additional things to consider that I
might have missed. If its more appropriate I can move the discussion to a
jira ticket as well so please let me know.Thanks in advance.
Saikat Kanjilal
2016-07-13 05:56:27 UTC
Permalink
Andrew,
Ping on this , let me know your thoughts on my pull request.
Thanks

Sent from my iPad
Post by Andrew Musselman
Excellent, thanks Saikat; I'll be able to take a look over the weekend.
Post by Saikat Kanjilal
Ok folks I've created a pull request here for a barebones runtime
1) measures two simple timings from ssvd and spca
2) dumps these timings into a csv file
https://github.com/apache/mahout/pull/245
[https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<
https://github.com/apache/mahout/pull/245>
Mahout 1869 by skanjila · Pull Request #245 · apache/mahout<
https://github.com/apache/mahout/pull/245>
github.com
Added the ability to dump output to csv file
I'd be greatly appreciative if I can get some early feedback on design
before moving forward and making too many more changes and not getting
something included. I will move ahead with the zeppelin integration in a
few days and reorganize the code a bit to include all the perf related
pieces into one class or trait.
Thanks in advance for your help.
Thanks in advance.
________________________________
Sent: Tuesday, June 21, 2016 9:21 PM
Subject: RE: [Discuss--A proposal for building an application in mahout to
measure runtime performance of algorithms in mahout]
Ok, so for now I am able to get around the issues bwlow by working on code
to measure performance times not requiring the notion of a
DIstributedContext to get this up and running, I have two methods that I am
https://github.com/skanjila/mahout/tree/mahout-1869
[https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<
https://github.com/skanjila/mahout/tree/mahout-1869>
skanjila/mahout<https://github.com/skanjila/mahout/tree/mahout-1869>
github.com
mahout - Mirror of Apache Mahout
Please provide feedback as I will now restructure/reorganize code to add
more methods and start work on a perf harness that spits out a report in
csv and then eventually tie this to zeppelin.
I've kept JIRA up to date as well.
Thanks in advance.
Subject: RE: [Discuss--A proposal for building an application in mahout
to measure runtime performance of algorithms in mahout]
Date: Mon, 20 Jun 2016 20:37:31 -0700
AndrewP et al,Any chance I can get some pointers on the items below,
would love some direction on this.Thanks
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Sun, 12 Jun 2016 12:40:26 -0700
1) I need to reuse/extend the DistributedContext inside the runtime
perf measurement module as all algorithms inside math-scala need this, I
was trying to mimic some of the H2O code and saw that they had their own
engine, I am wondering what the best way is to extend DistributedContext
and get the benefit of an already existing engine without needing to tie
into h2o or flink, or is the only way to add an engine to point to one of
those back ends, ideally I want to build the runtime perf module in a
backend agnostic way and currently I dont see a way around this,
thoughts?2) I also tried to reuse some of the logic inside math-scala but
in digging into this code it seems that this code is strongly tied to scala
test utilities
Net-Net: I just need access to the DistributedContext without linking
in any test utilities or backends.
Would love some advice on ways to move forward to maximize
reuse.Thanks in advance.
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 9 Jun 2016 21:45:13 -0700
Andrew et al,So I've finally been able to over the past few days got
a self contained module compiling that leverages the DistributedContext,
for starters I copied the NaiveBayes test code, ripped out the test
infrastructure code around it and then added some timers, next steps will
be to dump to csv and eventually to zeppelin, some questions before I get
1) I made the design decision to create my own trait and encapsulate
the context within that, I am wondering if I should instead leverage the
context that is already defined in math-scala ,, this however brings its
own complications in that it brings in the MahoutSuite which I'm not sure I
really need, thoughts on this
2) I need some infrastructure to run the perf framework , I can use
an azure ubuntu vm for now but is there an AWS instance or some other vm I
can eventually use, I would really like to avoid using my mac laptop as a
runtime perf testing environment
Thanks, I'll update JIRA as I make more headway.
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 08:58:49 -0700
Andrew,Thanks for the input, I will shift gears a bit and just get
some lightweight code going that calls into mahout algorithms and does a
csv dump out. Note that I think akka could be a good use for this as you
could make an async call and get back a notification when the csv dump is
finished. Also I am indeed not focusing on mapreduce algorithms and will
be tackling the algorithms in the math-scala library. What do you think of
making this a lightweight web based workbench using spray that committers
can run outside of mahout through curl or something, this was my initial
vision in using spray and its good that I'm getting early feedback.
On zeppelin do you think its worthwhile that I incorporate
Trevor's efforts to take that csv and turn that into one or two
visualizations. I'm trying to understand how that effort may(or may not)
intersect with what I'm trying to accomplish.
Also point taken on the small data sets.
Thanks
Subject: Re: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 15:50:16 +0000
Saikat,
If you're going to pursue this there is a few things that I
would suggest. First, keep it light weight. We don't want to bring a a
lot of extra dependencies or data into the distribution. I'm not sure what
this means as far as spray/akka, but those seem like overkill in my
opinion. This should be able to be kept down to a simple csv dump I think.
Second, use Data that can be either randomly generated with a
seeded RNG, or a function like Mackey-Glass or downloaded (probably best),
and only use a small very small sample in the tests- since they're pretty
long currently. The main point being that we don't want to ship any large
test datasets with the distro.
Third, we're not using MapReduce anymore, so focus on algorithms
in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix
algebra operations. That is where i see this being useful, so that we may
compare changes and optimizations going forward.
Thanks,
Andy
________________________________________
Sent: Friday, June 3, 2016 12:35:54 AM
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Hi All,Created a JIRA ticket and have moved the discussion for
https://issues.apache.org/jira/browse/MAHOUT-1869
@AndrewP & Trevor I would like to integrate zeppelin into the
runtime performance measurement framework to output some measurement
related data for some of the algorithms.
Should I wait till the zeppelin integration is completely
working before I incorporate this piece?
Also would really some feedback either on the JIRA ticket or in
response to this thread.Regards
Subject: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 19 May 2016 21:31:05 -0700
This proposal will outline a runtime performance module used
to measure the performance of various algorithms in mahout in the three
major areas, clustering, regression and classification. The module will be
a spray/scala/akka application which will be run by any current or new
algorithm in mahout and will display a csv file and a set of zeppelin plots
outlining the various criteria for performance. The goal of releasing
any new build in mahout will be to run a set of tests for each of the
algorithms to compare and contrast some benchmarks from one release to
another.
Architecture
The run time performance application will run on top of
spray/scala and akka and will make async api calls into the various mahout
algorithms to generate a cvs file containing data representing the run time
performance measurement calculations for each algorithm of interest as well
as a set of zeppelin plots for displaying some of these results. The spray
scala architecture will leverage the zeppelin server to create the
visualizations. The discussion below centers around two types of
algorithms to be addressed by the application.
Clustering
The application will consist of a set of rest APIs to do the
a) A method to load and execute the run time perf module and
takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a
location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
and finally a set of values for the number of clusters to use for each of
the different sizes of the datasets
The above API call will return a runId which the client
program can then use to monitor the module
b) A method to monitor the application to ensure that its
making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into
the mahout kmeans (fuzzy kmeans) clustering implementations and will
generate zeppelin plots showing the normalized time on the y axis and the
number of clusters in the x axis. The spray/scala akka framework will
allow the client application to receive a callback when the run time
performance calculations are actually completed. For now the calculations
for measuring run time performance will contain: a) the ratio of the number
of points clustered correctly to the total number of points b) the total
time taken for the algorithm to run . These items will be represented in
separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood
ratio test with a different set of features in every run . We will
introduce a rest API to run the likelihood ratio test and return the
results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following
metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for
algorithm to converge and run to completion. These metrics will be
reported on top of the zeppelin graphs for both the regression and the
different clustering algorithms mentioned above.
How does the application get runThe run time performance
measuring application will get invoked from the command line, eventually it
would be worthwhile to hook this into some sort of integration test suite
to certify the different mahout releases.
I will add more thoughts around this and create a JIRA ticket
only once there's enough consensus between the committers that this is
headed in the right direction. I will also add some more thoughts on
measuring run time performance of some of the other algorithms after some
more research.
Would love feedback or additional things to consider that I
might have missed. If its more appropriate I can move the discussion to a
jira ticket as well so please let me know.Thanks in advance.
Saikat Kanjilal
2016-07-21 02:55:21 UTC
Permalink
Hello Mahout Committers,
I would like to follow up on this and am eager to get traction/feedback on my pull request so far. Would love to get an initial implementation working for this in the next month so look forward to hearing from folks who would be interested in using this.

Thanks in advance.

Sent from my iPhone
Post by Saikat Kanjilal
Andrew,
Ping on this , let me know your thoughts on my pull request.
Thanks
Sent from my iPad
Post by Andrew Musselman
Excellent, thanks Saikat; I'll be able to take a look over the weekend.
Post by Saikat Kanjilal
Ok folks I've created a pull request here for a barebones runtime
1) measures two simple timings from ssvd and spca
2) dumps these timings into a csv file
https://github.com/apache/mahout/pull/245
[https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<
https://github.com/apache/mahout/pull/245>
Mahout 1869 by skanjila · Pull Request #245 · apache/mahout<
https://github.com/apache/mahout/pull/245>
github.com
Added the ability to dump output to csv file
I'd be greatly appreciative if I can get some early feedback on design
before moving forward and making too many more changes and not getting
something included. I will move ahead with the zeppelin integration in a
few days and reorganize the code a bit to include all the perf related
pieces into one class or trait.
Thanks in advance for your help.
Thanks in advance.
________________________________
Sent: Tuesday, June 21, 2016 9:21 PM
Subject: RE: [Discuss--A proposal for building an application in mahout to
measure runtime performance of algorithms in mahout]
Ok, so for now I am able to get around the issues bwlow by working on code
to measure performance times not requiring the notion of a
DIstributedContext to get this up and running, I have two methods that I am
https://github.com/skanjila/mahout/tree/mahout-1869
[https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<
https://github.com/skanjila/mahout/tree/mahout-1869>
skanjila/mahout<https://github.com/skanjila/mahout/tree/mahout-1869>
github.com
mahout - Mirror of Apache Mahout
Please provide feedback as I will now restructure/reorganize code to add
more methods and start work on a perf harness that spits out a report in
csv and then eventually tie this to zeppelin.
I've kept JIRA up to date as well.
Thanks in advance.
Subject: RE: [Discuss--A proposal for building an application in mahout
to measure runtime performance of algorithms in mahout]
Date: Mon, 20 Jun 2016 20:37:31 -0700
AndrewP et al,Any chance I can get some pointers on the items below,
would love some direction on this.Thanks
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Sun, 12 Jun 2016 12:40:26 -0700
1) I need to reuse/extend the DistributedContext inside the runtime
perf measurement module as all algorithms inside math-scala need this, I
was trying to mimic some of the H2O code and saw that they had their own
engine, I am wondering what the best way is to extend DistributedContext
and get the benefit of an already existing engine without needing to tie
into h2o or flink, or is the only way to add an engine to point to one of
those back ends, ideally I want to build the runtime perf module in a
backend agnostic way and currently I dont see a way around this,
thoughts?2) I also tried to reuse some of the logic inside math-scala but
in digging into this code it seems that this code is strongly tied to scala
test utilities
Net-Net: I just need access to the DistributedContext without linking
in any test utilities or backends.
Would love some advice on ways to move forward to maximize
reuse.Thanks in advance.
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 9 Jun 2016 21:45:13 -0700
Andrew et al,So I've finally been able to over the past few days got
a self contained module compiling that leverages the DistributedContext,
for starters I copied the NaiveBayes test code, ripped out the test
infrastructure code around it and then added some timers, next steps will
be to dump to csv and eventually to zeppelin, some questions before I get
1) I made the design decision to create my own trait and encapsulate
the context within that, I am wondering if I should instead leverage the
context that is already defined in math-scala ,, this however brings its
own complications in that it brings in the MahoutSuite which I'm not sure I
really need, thoughts on this
2) I need some infrastructure to run the perf framework , I can use
an azure ubuntu vm for now but is there an AWS instance or some other vm I
can eventually use, I would really like to avoid using my mac laptop as a
runtime perf testing environment
Thanks, I'll update JIRA as I make more headway.
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 08:58:49 -0700
Andrew,Thanks for the input, I will shift gears a bit and just get
some lightweight code going that calls into mahout algorithms and does a
csv dump out. Note that I think akka could be a good use for this as you
could make an async call and get back a notification when the csv dump is
finished. Also I am indeed not focusing on mapreduce algorithms and will
be tackling the algorithms in the math-scala library. What do you think of
making this a lightweight web based workbench using spray that committers
can run outside of mahout through curl or something, this was my initial
vision in using spray and its good that I'm getting early feedback.
On zeppelin do you think its worthwhile that I incorporate
Trevor's efforts to take that csv and turn that into one or two
visualizations. I'm trying to understand how that effort may(or may not)
intersect with what I'm trying to accomplish.
Also point taken on the small data sets.
Thanks
Subject: Re: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 15:50:16 +0000
Saikat,
If you're going to pursue this there is a few things that I
would suggest. First, keep it light weight. We don't want to bring a a
lot of extra dependencies or data into the distribution. I'm not sure what
this means as far as spray/akka, but those seem like overkill in my
opinion. This should be able to be kept down to a simple csv dump I think.
Second, use Data that can be either randomly generated with a
seeded RNG, or a function like Mackey-Glass or downloaded (probably best),
and only use a small very small sample in the tests- since they're pretty
long currently. The main point being that we don't want to ship any large
test datasets with the distro.
Third, we're not using MapReduce anymore, so focus on algorithms
in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix
algebra operations. That is where i see this being useful, so that we may
compare changes and optimizations going forward.
Thanks,
Andy
________________________________________
Sent: Friday, June 3, 2016 12:35:54 AM
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Hi All,Created a JIRA ticket and have moved the discussion for
https://issues.apache.org/jira/browse/MAHOUT-1869
@AndrewP & Trevor I would like to integrate zeppelin into the
runtime performance measurement framework to output some measurement
related data for some of the algorithms.
Should I wait till the zeppelin integration is completely
working before I incorporate this piece?
Also would really some feedback either on the JIRA ticket or in
response to this thread.Regards
Subject: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 19 May 2016 21:31:05 -0700
This proposal will outline a runtime performance module used
to measure the performance of various algorithms in mahout in the three
major areas, clustering, regression and classification. The module will be
a spray/scala/akka application which will be run by any current or new
algorithm in mahout and will display a csv file and a set of zeppelin plots
outlining the various criteria for performance. The goal of releasing
any new build in mahout will be to run a set of tests for each of the
algorithms to compare and contrast some benchmarks from one release to
another.
Architecture
The run time performance application will run on top of
spray/scala and akka and will make async api calls into the various mahout
algorithms to generate a cvs file containing data representing the run time
performance measurement calculations for each algorithm of interest as well
as a set of zeppelin plots for displaying some of these results. The spray
scala architecture will leverage the zeppelin server to create the
visualizations. The discussion below centers around two types of
algorithms to be addressed by the application.
Clustering
The application will consist of a set of rest APIs to do the
a) A method to load and execute the run time perf module and
takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a
location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
and finally a set of values for the number of clusters to use for each of
the different sizes of the datasets
The above API call will return a runId which the client
program can then use to monitor the module
b) A method to monitor the application to ensure that its
making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into
the mahout kmeans (fuzzy kmeans) clustering implementations and will
generate zeppelin plots showing the normalized time on the y axis and the
number of clusters in the x axis. The spray/scala akka framework will
allow the client application to receive a callback when the run time
performance calculations are actually completed. For now the calculations
for measuring run time performance will contain: a) the ratio of the number
of points clustered correctly to the total number of points b) the total
time taken for the algorithm to run . These items will be represented in
separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood
ratio test with a different set of features in every run . We will
introduce a rest API to run the likelihood ratio test and return the
results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following
metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for
algorithm to converge and run to completion. These metrics will be
reported on top of the zeppelin graphs for both the regression and the
different clustering algorithms mentioned above.
How does the application get runThe run time performance
measuring application will get invoked from the command line, eventually it
would be worthwhile to hook this into some sort of integration test suite
to certify the different mahout releases.
I will add more thoughts around this and create a JIRA ticket
only once there's enough consensus between the committers that this is
headed in the right direction. I will also add some more thoughts on
measuring run time performance of some of the other algorithms after some
more research.
Would love feedback or additional things to consider that I
might have missed. If its more appropriate I can move the discussion to a
jira ticket as well so please let me know.Thanks in advance.
Saikat Kanjilal
2016-08-05 16:02:52 UTC
Permalink
AndrewM/AndrewP and others that may be interested,

I am going to move ahead with the next additions to the runtime performane measurement framework as I have not heard back from either of you on my previous requests, this includes:

1) Organizing the code a bit more so that the driver collects the data into csv files into a pre created directory

2) Adding the zeppelin interface so that one can view the pre and post effects of making changes to a set of hyperparameters for each of the algorithms



Please respond an/or look at my github pull request, I will assume that you guys are in general agreement in the direction that this is headed if I don't hear back otherwise. Also please do share your thoughts if you see additional things I should think about putting in.



Thanks in advance.


________________________________
From: Saikat Kanjilal <***@hotmail.com>
Sent: Wednesday, July 20, 2016 7:55 PM
To: ***@mahout.apache.org
Subject: Re: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]

Hello Mahout Committers,
I would like to follow up on this and am eager to get traction/feedback on my pull request so far. Would love to get an initial implementation working for this in the next month so look forward to hearing from folks who would be interested in using this.

Thanks in advance.

Sent from my iPhone
Post by Saikat Kanjilal
Andrew,
Ping on this , let me know your thoughts on my pull request.
Thanks
Sent from my iPad
Post by Andrew Musselman
Excellent, thanks Saikat; I'll be able to take a look over the weekend.
Post by Saikat Kanjilal
Ok folks I've created a pull request here for a barebones runtime
1) measures two simple timings from ssvd and spca
2) dumps these timings into a csv file
https://github.com/apache/mahout/pull/245
[https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<https://github.com/apache/mahout/pull/245>

Mahout 1869 by skanjila · Pull Request #245 · apache/mahout<https://github.com/apache/mahout/pull/245>
github.com
Added the ability to dump output to csv file
Post by Saikat Kanjilal
Post by Andrew Musselman
Post by Saikat Kanjilal
[https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<
https://github.com/apache/mahout/pull/245>
Mahout 1869 by skanjila · Pull Request #245 · apache/mahout<
https://github.com/apache/mahout/pull/245>
github.com
Added the ability to dump output to csv file
I'd be greatly appreciative if I can get some early feedback on design
before moving forward and making too many more changes and not getting
something included. I will move ahead with the zeppelin integration in a
few days and reorganize the code a bit to include all the perf related
pieces into one class or trait.
Thanks in advance for your help.
Thanks in advance.
________________________________
Sent: Tuesday, June 21, 2016 9:21 PM
Subject: RE: [Discuss--A proposal for building an application in mahout to
measure runtime performance of algorithms in mahout]
Ok, so for now I am able to get around the issues bwlow by working on code
to measure performance times not requiring the notion of a
DIstributedContext to get this up and running, I have two methods that I am
https://github.com/skanjila/mahout/tree/mahout-1869
[https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<
https://github.com/skanjila/mahout/tree/mahout-1869>
skanjila/mahout<https://github.com/skanjila/mahout/tree/mahout-1869>
github.com
mahout - Mirror of Apache Mahout
Please provide feedback as I will now restructure/reorganize code to add
more methods and start work on a perf harness that spits out a report in
csv and then eventually tie this to zeppelin.
I've kept JIRA up to date as well.
Thanks in advance.
Subject: RE: [Discuss--A proposal for building an application in mahout
to measure runtime performance of algorithms in mahout]
Date: Mon, 20 Jun 2016 20:37:31 -0700
AndrewP et al,Any chance I can get some pointers on the items below,
would love some direction on this.Thanks
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Sun, 12 Jun 2016 12:40:26 -0700
1) I need to reuse/extend the DistributedContext inside the runtime
perf measurement module as all algorithms inside math-scala need this, I
was trying to mimic some of the H2O code and saw that they had their own
engine, I am wondering what the best way is to extend DistributedContext
and get the benefit of an already existing engine without needing to tie
into h2o or flink, or is the only way to add an engine to point to one of
those back ends, ideally I want to build the runtime perf module in a
backend agnostic way and currently I dont see a way around this,
thoughts?2) I also tried to reuse some of the logic inside math-scala but
in digging into this code it seems that this code is strongly tied to scala
test utilities
Net-Net: I just need access to the DistributedContext without linking
in any test utilities or backends.
Would love some advice on ways to move forward to maximize
reuse.Thanks in advance.
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 9 Jun 2016 21:45:13 -0700
Andrew et al,So I've finally been able to over the past few days got
a self contained module compiling that leverages the DistributedContext,
for starters I copied the NaiveBayes test code, ripped out the test
infrastructure code around it and then added some timers, next steps will
be to dump to csv and eventually to zeppelin, some questions before I get
1) I made the design decision to create my own trait and encapsulate
the context within that, I am wondering if I should instead leverage the
context that is already defined in math-scala ,, this however brings its
own complications in that it brings in the MahoutSuite which I'm not sure I
really need, thoughts on this
2) I need some infrastructure to run the perf framework , I can use
an azure ubuntu vm for now but is there an AWS instance or some other vm I
can eventually use, I would really like to avoid using my mac laptop as a
runtime perf testing environment
Thanks, I'll update JIRA as I make more headway.
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 08:58:49 -0700
Andrew,Thanks for the input, I will shift gears a bit and just get
some lightweight code going that calls into mahout algorithms and does a
csv dump out. Note that I think akka could be a good use for this as you
could make an async call and get back a notification when the csv dump is
finished. Also I am indeed not focusing on mapreduce algorithms and will
be tackling the algorithms in the math-scala library. What do you think of
making this a lightweight web based workbench using spray that committers
can run outside of mahout through curl or something, this was my initial
vision in using spray and its good that I'm getting early feedback.
On zeppelin do you think its worthwhile that I incorporate
Trevor's efforts to take that csv and turn that into one or two
visualizations. I'm trying to understand how that effort may(or may not)
intersect with what I'm trying to accomplish.
Also point taken on the small data sets.
Thanks
Subject: Re: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Mon, 6 Jun 2016 15:50:16 +0000
Saikat,
If you're going to pursue this there is a few things that I
would suggest. First, keep it light weight. We don't want to bring a a
lot of extra dependencies or data into the distribution. I'm not sure what
this means as far as spray/akka, but those seem like overkill in my
opinion. This should be able to be kept down to a simple csv dump I think.
Second, use Data that can be either randomly generated with a
seeded RNG, or a function like Mackey-Glass or downloaded (probably best),
and only use a small very small sample in the tests- since they're pretty
long currently. The main point being that we don't want to ship any large
test datasets with the distro.
Third, we're not using MapReduce anymore, so focus on algorithms
in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix
algebra operations. That is where i see this being useful, so that we may
compare changes and optimizations going forward.
Thanks,
Andy
________________________________________
Sent: Friday, June 3, 2016 12:35:54 AM
Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Hi All,Created a JIRA ticket and have moved the discussion for
https://issues.apache.org/jira/browse/MAHOUT-1869
@AndrewP & Trevor I would like to integrate zeppelin into the
runtime performance measurement framework to output some measurement
related data for some of the algorithms.
Should I wait till the zeppelin integration is completely
working before I incorporate this piece?
Also would really some feedback either on the JIRA ticket or in
response to this thread.Regards
Subject: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
Date: Thu, 19 May 2016 21:31:05 -0700
This proposal will outline a runtime performance module used
to measure the performance of various algorithms in mahout in the three
major areas, clustering, regression and classification. The module will be
a spray/scala/akka application which will be run by any current or new
algorithm in mahout and will display a csv file and a set of zeppelin plots
outlining the various criteria for performance. The goal of releasing
any new build in mahout will be to run a set of tests for each of the
algorithms to compare and contrast some benchmarks from one release to
another.
Architecture
The run time performance application will run on top of
spray/scala and akka and will make async api calls into the various mahout
algorithms to generate a cvs file containing data representing the run time
performance measurement calculations for each algorithm of interest as well
as a set of zeppelin plots for displaying some of these results. The spray
scala architecture will leverage the zeppelin server to create the
visualizations. The discussion below centers around two types of
algorithms to be addressed by the application.
Clustering
The application will consist of a set of rest APIs to do the
a) A method to load and execute the run time perf module and
takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a
location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
and finally a set of values for the number of clusters to use for each of
the different sizes of the datasets
The above API call will return a runId which the client
program can then use to monitor the module
b) A method to monitor the application to ensure that its
making progress towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into
the mahout kmeans (fuzzy kmeans) clustering implementations and will
generate zeppelin plots showing the normalized time on the y axis and the
number of clusters in the x axis. The spray/scala akka framework will
allow the client application to receive a callback when the run time
performance calculations are actually completed. For now the calculations
for measuring run time performance will contain: a) the ratio of the number
of points clustered correctly to the total number of points b) the total
time taken for the algorithm to run . These items will be represented in
separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood
ratio test with a different set of features in every run . We will
introduce a rest API to run the likelihood ratio test and return the
results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following
metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for
algorithm to converge and run to completion. These metrics will be
reported on top of the zeppelin graphs for both the regression and the
different clustering algorithms mentioned above.
How does the application get runThe run time performance
measuring application will get invoked from the command line, eventually it
would be worthwhile to hook this into some sort of integration test suite
to certify the different mahout releases.
I will add more thoughts around this and create a JIRA ticket
only once there's enough consensus between the committers that this is
headed in the right direction. I will also add some more thoughts on
measuring run time performance of some of the other algorithms after some
more research.
Would love feedback or additional things to consider that I
might have missed. If its more appropriate I can move the discussion to a
jira ticket as well so please let me know.Thanks in advance.
Loading...