Discussion:
Future Mahout - Zeppelin work
Pat Ferrel
2016-05-16 23:42:42 UTC
Permalink
Creating an mc used to do some Kryo setup, like registering serializers or serializer factories IIRC. Also there is the Spark conf for allocating memory for the Kryo buffer. Look at the code in the mc creation code in the Spark package helpers. All can be done in straight Spark and passed in to create the mc when needed. Again from old weak brain cells but I think that is part of what makes the Mahout shell different than teh Spark shell plus imports, it auto-creates the mc instead of or along with an sc.

When I get back to my computer I can check.

On May 16, 2016, at 3:40 PM, Andrew Palumbo <***@outlook.com> wrote:

Trevor,

Could you post any kryo errors that you may be having?

________________________________
From: Andrew Palumbo <***@outlook.com>
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work




To Dmitriy's point, I agree ggplot is def the priority, The mahout plots are at this point are really just a POC, but at some point we may be want to integrate some data transformation features into the mahout plots classes so they're really more future work.
OK. I'll read through the examples and try to do something with some data, then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that Zeppelin issue about weather we want to go ahead and add another interpreter.
Souds Great.


Thank you.

________________________________
From: Trevor Grant <***@gmail.com>
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work

I just signed up for dev, should i just reply all and cc dev or start a new thread?

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo>

rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 12 repositories written in Python, Batchfile, and R. Follow their code on GitHub.


http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

"Fortunate is he, who is able to know the causes of things." -Virgil


On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <***@gmail.com<mailto:***@gmail.com>> wrote:
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile would have something that ggplot2 would not, the other way around is much more expected by me:)

anyhow if ggplot2 and matplotlib are available in Zeppelin without major limitations, it sounds like Zeppelin should be an all around very nice venue then.

On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <***@outlook.com<mailto:***@outlook.com>> wrote:

yeah we should probably move this over to dev@


sorry- answering a question from a couple emails back on the thread.


If possible, I think it would be great to eventually have both (native mahout/smile plots and ggplot), since in the future we're going to be adding more visualization features rather than simple scatter plots etc that may not be covered by ggplot.


That's why we were thinking about using angular and the pngs.


But what youre saying in your last email would be great!


Thank you!


________________________________
From: Trevor Grant <***@gmail.com<mailto:***@gmail.com>>
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov

Subject: Re: Intro - Future Mahout - Zeppelin work

I somehow replied to your last email without seeing it...

OK. I'll read through the examples and try to do something with some data, then do a ggplot and/or an angular plot on it (probably ggplot).

I'll do a quick tutorial. Then I'll reopen discussion on that Zeppelin issue about weather we want to go ahead and add another interpreter.

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

"Fortunate is he, who is able to know the causes of things." -Virgil


On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <***@gmail.com<mailto:***@gmail.com>> wrote:
sorry for double email but are you thinking visualization should be a library internal to mahout or should we leverage zeppelins visualization capabilities?

Also, should we move this discussion to dev?

tg


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

"Fortunate is he, who is able to know the causes of things." -Virgil


On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <***@outlook.com<mailto:***@outlook.com>> wrote:

Sorry- to be a little more clear, Part of what we're trying to is to get the new plotting features integrated with Zeppelin. We plan on adding more advanced plotting.


________________________________
From: Andrew Palumbo <***@outlook.com<mailto:***@outlook.com>>
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work


Awesome!


most of the hard work was done by Dmitriy[??] , I've just reworked it a couple of times to keep up with spark's refactoring.


I think that you will also need to include:


mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar


For the new plotting features that we're working on.


the plotting is still a work in progress, and the grid and surface plots are not working properly. The plots are swing based and can currently be exported as PNGs. There are a few examples on the closed PR: https://github.com/apache/mahout/pull/230


There is an example script in examples/bin/spark-shell-plot.mscala (commited to master) :https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala


Thanks!



________________________________
From: Pat Ferrel <***@occamsmachete.com<mailto:***@occamsmachete.com>>
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work

This is only the beginning. Andy has been using Smile as a visualization lib since it is pretty rich in ML support. We are looking at integrating some of that with Zeppelin then adding code to feed the new visualizations in Mahout. I’m here because I’m fairly familiar with AngularJS if that’s the way to go. Smile is swing based but can output pngs, maybe other image formats—Andy?

BTW Dmitriy is still very involved but has rouble getting permission to donate code.


On May 16, 2016, at 1:45 PM, Trevor Grant <***@gmail.com<mailto:***@gmail.com>> wrote:

Hey Andrew,

thanks- you basically did all of the hard work for me!

I've got the linear regression example working from: http://mahout.apache.org/users/sparkbindings/play-with-shell.html

my java is sketchy at best, i tend to over import. I pulled in the following jars:
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar

I think those are all necessary... should I be pulling in more?

I hate to say it (but will do so bc this isn't public) this integration is super easy from a user perspective, almost too easy- eg why not let the user add it themselves... Add the appropriate maven artifacts, restart the interpreter and run the following in a notebook:
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._

implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = sc2sdc(sc)
```
Then whatever code you want and you're off to the races...

that said, adding a build profile like -PsparkMahout and creating an interpretter like %spark.mahout should be fairly straight forward.

Second question, do you have an example that would be more 'visualization friendly'? I could pass the results to Angular or R just to show off how to do it.

Which leads back to the question, is this even worth building a full interpreter for or just make a really nice blog post with examples on how to integrate with R...?








Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>

"Fortunate is he, who is able to know the causes of things." -Virgil


On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <***@outlook.com<mailto:***@outlook.com>> wrote:
Hi Trevor, welcome!

It's great to have you helping out, thanks very much. I've done a good amount of work on our mahout spark shell .. so let me know if you have any questions there about what we did there..

Thanks alot!

Andy


-------- Original message --------
From: Suneel Marthi <***@apache.org<mailto:***@apache.org>>
Date: 05/16/2016 2:44 PM (GMT-05:00)
To: Trevor Grant <***@gmail.com<mailto:***@gmail.com>>
Cc: Suneel Marthi <***@apache.org<mailto:***@apache.org>>, Pat Ferrel <***@occamsmachete.com<mailto:***@occamsmachete.com>>, Andrew Palumbo <***@outlook.com<mailto:***@outlook.com>>
Subject: Re: Intro - Future Mahout - Zeppelin work

Oh yes, he's around. I see him online.

On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <***@gmail.com<mailto:***@gmail.com>> wrote:
Is Dmitriy Lyubimov still around?

Looks like he created this issue for Zeppelin a while ago. (The old lost code to which you were referring?)

https://issues.apache.org/jira/browse/ZEPPELIN-116


tg


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>

"Fortunate is he, who is able to know the causes of things." -Virgil


On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <***@apache.org<mailto:***@apache.org>> wrote:
Welcome to the party TG !!

On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <***@gmail.com<mailto:***@gmail.com>> wrote:
Hey all,

I'm excited for a chance to help out. I'm actually getting ready to download now and start playing around.

I had talked about this briefly but it given a properly functioning Zeppelin interpreter for Apache Mahout, one could leverage all of the Zeppelin visualizations, anything in AngularJS, or anything in R (through clever use of Zeppelin's Resource Pools).

I'll work on getting logged in to the slack channel as well.

Nice to meet you all, looking forward to helping out!

tg


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>

"Fortunate is he, who is able to know the causes of things." -Virgil


On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <***@apache.org<mailto:***@apache.org>> wrote:
FYi...
Trevor was there for my talk, so he has some idea of Mahout Samsara.

On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <***@occamsmachete.com<mailto:***@occamsmachete.com>> wrote:
Hey Trevor,

Good to meet you. As you probably know Mahout-Samsara is a reincarnation of the project in a new body, which is less a collection of algorithms than a roll-your-own math/algorithm tool. The major benefit is that during experimentation and later in production the code is by nature scalable on Spark and Flink. Most of the Mahout DSL is R-like and supports tensor math but we are now looking at streaming online algo support too.

In any case you probably know we have a Mahout version of the Spark Shell, which has been integrated with an old version of Zeppelin (code is lost). Recently Andy has experimented with some very nice visualizations of ML data (not just analytics data). We as a project are interested in Zeppelin integration of our shell and graphics. From what I understand the graphics extension mechanism of Zeppelin is based on AngularJS, which I have some experience with.

So, we’d like to start the conversation about how to proceed. We would love some help but will move ahead in any case.

Pat


On May 15, 2016, at 9:52 AM, Suneel Marthi <***@apache.org<mailto:***@apache.org>> wrote:

Hi Trevor,

Nice meeting u last week in Vancouver. Per our conversation, I wanted to introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel (Mahout PMC).

As I mentioned in my talk, we are actively looking at Zeppelin integration with Mahout (primarily for spark) and would appreciate your help (as also all things DL and ML).

We definitely can use all your help as we r revamping the Mahout project and shedding its legacy MapReduce image.

I sent u an invite to the Mahout slack channel, mahout.apache.org<http://mahout.apache.org/> - that's where we all hangout and not having to worry about avoiding naughty words.

Looking forward to working with you

Suneel
Trevor Grant
2016-05-17 01:18:14 UTC
Permalink
As a quick recap- we're trying to leverage Zeppelin for charting.

It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph

All seems to be working well, but I've fooled myself into thinking things
were 'working' before because I wasn't actually integrating. Lower I will
outline the imports/properties, please look over and tell me if I'm
theoretically missing anything.

The next phase for me will be
1) Convert a matrix to some sort of serializable object that I can easily
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a dataframe then map
using ggplot

Once I have a working prototype I will work add some syntactic sugar to
prepare the matrix from the scala side and pass to zeppelin (using resource
pools so the same functionality can be reused in Flink) and an R library
containing some functions which will pull the data out of the resource pool
and spit out a dataframe.

Once its in a Dataframe in R- go nuts with any plotting package you like.
Likewise, it should be possible to do the same thing with matplotlib and
python (https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)

All of this doesn't necessarily require any changing of the Zeppelin source
code, and isn't very intrusive or difficult to set up, I'll make a blog
post but its almost a text book entry tutorial on using imports in
Zeppelin. (e.g. a tutorial would be just as at home on the Zeppelin site as
it would on the Mahout site).

Now, there has been some talk of using Zeppelin's angularJS. Things get a
little more harry in that case, but we could make an optional build profile
that would make zeppelin recognize matrices at tables and expose all of the
built in charting features of Zeppelin.

If you're not adding a bunch of custom charts to Zeppelin (which would be
somewhat tedious), you're going to end up with a lot of examples where you
create a table in Mahout/Spark pass it to AngularJS then some AngularJS
code charts it for you. At that point however, you're doing just as much
work, if not more than it would be to simply pass to R or Python and let
ggplot or matlibplot do the work for you.

Finally, I haven't run into any errors yet using Kyro (which in part is
what makes me fear I'm not doing this right... it was too easy...) If
anything seems redundant or missing, please call it out.

Add Properties to Spark interp:

spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer

Add artifacts (need to change these to maven not local, also need to
add/change one jar per below, however this does run):

/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar

Add following code to first paragraph of notebook:
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._

implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext =
sc2sdc(sc)
```



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering serializers or
serializer factories IIRC. Also there is the Spark conf for allocating
memory for the Kryo buffer. Look at the code in the mc creation code in the
Spark package helpers. All can be done in straight Spark and passed in to
create the mc when needed. Again from old weak brain cells but I think that
is part of what makes the Mahout shell different than teh Spark shell plus
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The mahout plots
are at this point are really just a POC, but at some point we may be want
to integrate some data transformation features into the mahout plots
classes so they're really more future work.
OK. I'll read through the examples and try to do something with some
data, then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that Zeppelin
issue about weather we want to go ahead and add another interpreter.
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or start a new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 12 repositories written in Python, Batchfile, and R.
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile would
have something that ggplot2 would not, the other way around is much more
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin without major
limitations, it sounds like Zeppelin should be an all around very nice
venue then.
sorry- answering a question from a couple emails back on the thread.
If possible, I think it would be great to eventually have both (native
mahout/smile plots and ggplot), since in the future we're going to be
adding more visualization features rather than simple scatter plots etc
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with some data,
then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that Zeppelin
issue about weather we want to go ahead and add another interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
sorry for double email but are you thinking visualization should be a
library internal to mahout or should we leverage zeppelins visualization
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
Sorry- to be a little more clear, Part of what we're trying to is to get
the new plotting features integrated with Zeppelin. We plan on adding more
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just reworked it a
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and surface plots
are not working properly. The plots are swing based and can currently be
https://github.com/apache/mahout/pull/230
There is an example script in examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a visualization
lib since it is pretty rich in ML support. We are looking at integrating
some of that with Zeppelin then adding code to feed the new visualizations
in Mahout. I’m here because I’m fairly familiar with AngularJS if that’s
the way to go. Smile is swing based but can output pngs, maybe other image
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting permission to donate code.
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
I think those are all necessary... should I be pulling in more?
I hate to say it (but will do so bc this isn't public) this integration is
super easy from a user perspective, almost too easy- eg why not let the
user add it themselves... Add the appropriate maven artifacts, restart the
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and creating an
interpretter like %spark.mahout should be fairly straight forward.
Second question, do you have an example that would be more 'visualization
friendly'? I could pass the results to Angular or R just to show off how to
do it.
Which leads back to the question, is this even worth building a full
interpreter for or just make a really nice blog post with examples on how
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've done a good
amount of work on our mahout spark shell .. so let me know if you have any
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The old lost
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Welcome to the party TG !!
Hey all,
I'm excited for a chance to help out. I'm actually getting ready to
download now and start playing around.
I had talked about this briefly but it given a properly functioning
Zeppelin interpreter for Apache Mahout, one could leverage all of the
Zeppelin visualizations, anything in AngularJS, or anything in R (through
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
FYi...
Trevor was there for my talk, so he has some idea of Mahout Samsara.
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a reincarnation
of the project in a new body, which is less a collection of algorithms than
a roll-your-own math/algorithm tool. The major benefit is that during
experimentation and later in production the code is by nature scalable on
Spark and Flink. Most of the Mahout DSL is R-like and supports tensor math
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the Spark Shell,
which has been integrated with an old version of Zeppelin (code is lost).
Recently Andy has experimented with some very nice visualizations of ML
data (not just analytics data). We as a project are interested in Zeppelin
integration of our shell and graphics. From what I understand the graphics
extension mechanism of Zeppelin is based on AngularJS, which I have some
experience with.
So, we’d like to start the conversation about how to proceed. We would
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I wanted to
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel (Mahout PMC).
As I mentioned in my talk, we are actively looking at Zeppelin integration
with Mahout (primarily for spark) and would appreciate your help (as also
all things DL and ML).
We definitely can use all your help as we r revamping the Mahout project
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel, mahout.apache.org<
http://mahout.apache.org/> - that's where we all hangout and not having
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Pat Ferrel
2016-05-17 18:17:24 UTC
Permalink
Seems like there is plenty to use in ggplot or python but the pipeline is a little convoluted (so maybe no need for Angular integration). To get graphics out of Mahout it would be nice to not require knowledge of R and/or python. Knowing Mahout is already bad enough but I guess the API from the Mahout side for plotting could be Scala syntactic sugar. What and how this all is installed and setup is the next question.

BTW this is what I use elsewhere (Mahout as a lib to this code)

"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,

afaik you will only see if Kryo is working when you have to serialize a mahout specific data type like vector of drm, something registered with Kryo.


On May 16, 2016, at 6:18 PM, Trevor Grant <***@gmail.com> wrote:

As a quick recap- we're trying to leverage Zeppelin for charting.

It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph

All seems to be working well, but I've fooled myself into thinking things
were 'working' before because I wasn't actually integrating. Lower I will
outline the imports/properties, please look over and tell me if I'm
theoretically missing anything.

The next phase for me will be
1) Convert a matrix to some sort of serializable object that I can easily
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a dataframe then map
using ggplot

Once I have a working prototype I will work add some syntactic sugar to
prepare the matrix from the scala side and pass to zeppelin (using resource
pools so the same functionality can be reused in Flink) and an R library
containing some functions which will pull the data out of the resource pool
and spit out a dataframe.

Once its in a Dataframe in R- go nuts with any plotting package you like.
Likewise, it should be possible to do the same thing with matplotlib and
python (https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)

All of this doesn't necessarily require any changing of the Zeppelin source
code, and isn't very intrusive or difficult to set up, I'll make a blog
post but its almost a text book entry tutorial on using imports in
Zeppelin. (e.g. a tutorial would be just as at home on the Zeppelin site as
it would on the Mahout site).

Now, there has been some talk of using Zeppelin's angularJS. Things get a
little more harry in that case, but we could make an optional build profile
that would make zeppelin recognize matrices at tables and expose all of the
built in charting features of Zeppelin.

If you're not adding a bunch of custom charts to Zeppelin (which would be
somewhat tedious), you're going to end up with a lot of examples where you
create a table in Mahout/Spark pass it to AngularJS then some AngularJS
code charts it for you. At that point however, you're doing just as much
work, if not more than it would be to simply pass to R or Python and let
ggplot or matlibplot do the work for you.

Finally, I haven't run into any errors yet using Kyro (which in part is
what makes me fear I'm not doing this right... it was too easy...) If
anything seems redundant or missing, please call it out.

Add Properties to Spark interp:

spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer

Add artifacts (need to change these to maven not local, also need to
add/change one jar per below, however this does run):

/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar

Add following code to first paragraph of notebook:
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._

implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext =
sc2sdc(sc)
```



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering serializers or
serializer factories IIRC. Also there is the Spark conf for allocating
memory for the Kryo buffer. Look at the code in the mc creation code in the
Spark package helpers. All can be done in straight Spark and passed in to
create the mc when needed. Again from old weak brain cells but I think that
is part of what makes the Mahout shell different than teh Spark shell plus
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The mahout plots
are at this point are really just a POC, but at some point we may be want
to integrate some data transformation features into the mahout plots
classes so they're really more future work.
OK. I'll read through the examples and try to do something with some
data, then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that Zeppelin
issue about weather we want to go ahead and add another interpreter.
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or start a new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 12 repositories written in Python, Batchfile, and R.
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile would
have something that ggplot2 would not, the other way around is much more
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin without major
limitations, it sounds like Zeppelin should be an all around very nice
venue then.
sorry- answering a question from a couple emails back on the thread.
If possible, I think it would be great to eventually have both (native
mahout/smile plots and ggplot), since in the future we're going to be
adding more visualization features rather than simple scatter plots etc
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with some data,
then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that Zeppelin
issue about weather we want to go ahead and add another interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
sorry for double email but are you thinking visualization should be a
library internal to mahout or should we leverage zeppelins visualization
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
Sorry- to be a little more clear, Part of what we're trying to is to get
the new plotting features integrated with Zeppelin. We plan on adding more
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just reworked it a
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and surface plots
are not working properly. The plots are swing based and can currently be
https://github.com/apache/mahout/pull/230
There is an example script in examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a visualization
lib since it is pretty rich in ML support. We are looking at integrating
some of that with Zeppelin then adding code to feed the new visualizations
in Mahout. I’m here because I’m fairly familiar with AngularJS if that’s
the way to go. Smile is swing based but can output pngs, maybe other image
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting permission to donate code.
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
I think those are all necessary... should I be pulling in more?
I hate to say it (but will do so bc this isn't public) this integration is
super easy from a user perspective, almost too easy- eg why not let the
user add it themselves... Add the appropriate maven artifacts, restart the
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and creating an
interpretter like %spark.mahout should be fairly straight forward.
Second question, do you have an example that would be more 'visualization
friendly'? I could pass the results to Angular or R just to show off how to
do it.
Which leads back to the question, is this even worth building a full
interpreter for or just make a really nice blog post with examples on how
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've done a good
amount of work on our mahout spark shell .. so let me know if you have any
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The old lost
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Welcome to the party TG !!
Hey all,
I'm excited for a chance to help out. I'm actually getting ready to
download now and start playing around.
I had talked about this briefly but it given a properly functioning
Zeppelin interpreter for Apache Mahout, one could leverage all of the
Zeppelin visualizations, anything in AngularJS, or anything in R (through
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
FYi...
Trevor was there for my talk, so he has some idea of Mahout Samsara.
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a reincarnation
of the project in a new body, which is less a collection of algorithms than
a roll-your-own math/algorithm tool. The major benefit is that during
experimentation and later in production the code is by nature scalable on
Spark and Flink. Most of the Mahout DSL is R-like and supports tensor math
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the Spark Shell,
which has been integrated with an old version of Zeppelin (code is lost).
Recently Andy has experimented with some very nice visualizations of ML
data (not just analytics data). We as a project are interested in Zeppelin
integration of our shell and graphics. From what I understand the graphics
extension mechanism of Zeppelin is based on AngularJS, which I have some
experience with.
So, we’d like to start the conversation about how to proceed. We would
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I wanted to
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel (Mahout PMC).
As I mentioned in my talk, we are actively looking at Zeppelin integration
with Mahout (primarily for spark) and would appreciate your help (as also
all things DL and ML).
We definitely can use all your help as we r revamping the Mahout project
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel, mahout.apache.org<
http://mahout.apache.org/> - that's where we all hangout and not having
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Trevor Grant
2016-05-18 14:47:21 UTC
Permalink
I still need to update my readme/env per Pat's comments below, however with
out further ado, I present two notebooks that integrate Mahout + Spark +
Zeppelin + ggplot2

https://github.com/rawkintrevo/mahout-zeppelin

Supposing you have a somewhat recent version of Zeppelin 0.6 with sparkr
support running already, you may import the following raw notes directly
into Zeppelin:

https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json

https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json

So my thoughs on next steps, which I'm positing only as a starting point
for discussion, and are in no particular order of importance:

- Blog on HOWTO for everyman (assumes no familiarity with Mahout, and only
enough familiarity with Zeppelin to have Zeppelin + SparkR support)
- Some syntactic sugar somewhere in Mahout to convert a matrix into a tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that Zeppelin is first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out what is the
absolute minimum 'bare-bones' mahout we can include, e.g. does the user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do the same
thing in Python.

The basic deal here is we are:
1) Setting up a standard Zeppelin Spark Interpretter to act like a Mahout
interpretter
- This is taken care of by setting some env. variables, adding some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a <plot package of your choice>

To Pat's point- this is a kind of clumsy pipeline, however the Zeppelin
wrapper at least makes it *feel* less so.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the pipeline is
a little convoluted (so maybe no need for Angular integration). To get
graphics out of Mahout it would be nice to not require knowledge of R
and/or python. Knowing Mahout is already bad enough but I guess the API
from the Mahout side for plotting could be Scala syntactic sugar. What and
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this code)
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to serialize a
mahout specific data type like vector of drm, something registered with
Kryo.
As a quick recap- we're trying to leverage Zeppelin for charting.
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into thinking things
were 'working' before because I wasn't actually integrating. Lower I will
outline the imports/properties, please look over and tell me if I'm
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I can easily
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a dataframe then map
using ggplot
Once I have a working prototype I will work add some syntactic sugar to
prepare the matrix from the scala side and pass to zeppelin (using resource
pools so the same functionality can be reused in Flink) and an R library
containing some functions which will pull the data out of the resource pool
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting package you like.
Likewise, it should be possible to do the same thing with matplotlib and
python (https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
All of this doesn't necessarily require any changing of the Zeppelin source
code, and isn't very intrusive or difficult to set up, I'll make a blog
post but its almost a text book entry tutorial on using imports in
Zeppelin. (e.g. a tutorial would be just as at home on the Zeppelin site as
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS. Things get a
little more harry in that case, but we could make an optional build profile
that would make zeppelin recognize matrices at tables and expose all of the
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin (which would be
somewhat tedious), you're going to end up with a lot of examples where you
create a table in Mahout/Spark pass it to AngularJS then some AngularJS
code charts it for you. At that point however, you're doing just as much
work, if not more than it would be to simply pass to R or Python and let
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which in part is
what makes me fear I'm not doing this right... it was too easy...) If
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also need to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering serializers
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for allocating
memory for the Kryo buffer. Look at the code in the mc creation code in
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and passed in to
create the mc when needed. Again from old weak brain cells but I think
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh Spark shell
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The mahout plots
are at this point are really just a POC, but at some point we may be want
to integrate some data transformation features into the mahout plots
classes so they're really more future work.
OK. I'll read through the examples and try to do something with some
data, then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that Zeppelin
issue about weather we want to go ahead and add another interpreter.
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or start a new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 12 repositories written in Python, Batchfile, and R.
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile would
have something that ggplot2 would not, the other way around is much more
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin without major
limitations, it sounds like Zeppelin should be an all around very nice
venue then.
sorry- answering a question from a couple emails back on the thread.
If possible, I think it would be great to eventually have both (native
mahout/smile plots and ggplot), since in the future we're going to be
adding more visualization features rather than simple scatter plots etc
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with some
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that Zeppelin
issue about weather we want to go ahead and add another interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
sorry for double email but are you thinking visualization should be a
library internal to mahout or should we leverage zeppelins visualization
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
Sorry- to be a little more clear, Part of what we're trying to is to get
the new plotting features integrated with Zeppelin. We plan on adding
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just reworked it a
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and surface plots
are not working properly. The plots are swing based and can currently be
https://github.com/apache/mahout/pull/230
There is an example script in examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a visualization
lib since it is pretty rich in ML support. We are looking at integrating
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with AngularJS if that’s
the way to go. Smile is swing based but can output pngs, maybe other
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting permission to donate code.
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
I think those are all necessary... should I be pulling in more?
I hate to say it (but will do so bc this isn't public) this integration
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why not let the
user add it themselves... Add the appropriate maven artifacts, restart
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and creating an
interpretter like %spark.mahout should be fairly straight forward.
Second question, do you have an example that would be more 'visualization
friendly'? I could pass the results to Angular or R just to show off how
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth building a full
interpreter for or just make a really nice blog post with examples on how
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've done a good
amount of work on our mahout spark shell .. so let me know if you have
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The old lost
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Welcome to the party TG !!
Hey all,
I'm excited for a chance to help out. I'm actually getting ready to
download now and start playing around.
I had talked about this briefly but it given a properly functioning
Zeppelin interpreter for Apache Mahout, one could leverage all of the
Zeppelin visualizations, anything in AngularJS, or anything in R (through
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
FYi...
Trevor was there for my talk, so he has some idea of Mahout Samsara.
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a reincarnation
of the project in a new body, which is less a collection of algorithms
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is that during
experimentation and later in production the code is by nature scalable on
Spark and Flink. Most of the Mahout DSL is R-like and supports tensor
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the Spark
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin (code is lost).
Recently Andy has experimented with some very nice visualizations of ML
data (not just analytics data). We as a project are interested in
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand the
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which I have some
experience with.
So, we’d like to start the conversation about how to proceed. We would
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I wanted to
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel (Mahout PMC).
As I mentioned in my talk, we are actively looking at Zeppelin
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your help (as also
all things DL and ML).
We definitely can use all your help as we r revamping the Mahout project
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel, mahout.apache.org<
http://mahout.apache.org/> - that's where we all hangout and not having
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Trevor Grant
2016-05-18 15:02:43 UTC
Permalink
ah yes- I remember you pointing that out to me too.

I got side tracked yesterday for most of the day on an adventure in getting
Zeppelin to work right after I accidently updated to the new snapshot (free
hint: the secret was to clear my cache *face-palm*)

I'm going to add that dependency to the readme.md now.

thanks,
tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Trevor this is very cool- I have not been able to look at it closely yet
but just a small point: I believe that you'll also need to add the
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below, however with
out further ado, I present two notebooks that integrate Mahout + Spark +
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6 with sparkr
support running already, you may import the following raw notes directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
So my thoughs on next steps, which I'm positing only as a starting point
- Blog on HOWTO for everyman (assumes no familiarity with Mahout, and only
enough familiarity with Zeppelin to have Zeppelin + SparkR support)
- Some syntactic sugar somewhere in Mahout to convert a matrix into a tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that Zeppelin is first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out what is the
absolute minimum 'bare-bones' mahout we can include, e.g. does the user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do the same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act like a Mahout
interpretter
- This is taken care of by setting some env. variables, adding some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a <plot package of your choice>
To Pat's point- this is a kind of clumsy pipeline, however the Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the pipeline is
a little convoluted (so maybe no need for Angular integration). To get
graphics out of Mahout it would be nice to not require knowledge of R
and/or python. Knowing Mahout is already bad enough but I guess the API
from the Mahout side for plotting could be Scala syntactic sugar. What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this code)
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to serialize a
mahout specific data type like vector of drm, something registered with
Kryo.
As a quick recap- we're trying to leverage Zeppelin for charting.
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into thinking things
were 'working' before because I wasn't actually integrating. Lower I will
outline the imports/properties, please look over and tell me if I'm
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I can easily
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a dataframe then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic sugar to
prepare the matrix from the scala side and pass to zeppelin (using
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and an R library
containing some functions which will pull the data out of the resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting package you like.
Likewise, it should be possible to do the same thing with matplotlib and
python (https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
All of this doesn't necessarily require any changing of the Zeppelin
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll make a blog
post but its almost a text book entry tutorial on using imports in
Zeppelin. (e.g. a tutorial would be just as at home on the Zeppelin site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS. Things get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional build
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and expose all of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin (which would be
somewhat tedious), you're going to end up with a lot of examples where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some AngularJS
code charts it for you. At that point however, you're doing just as much
work, if not more than it would be to simply pass to R or Python and let
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which in part is
what makes me fear I'm not doing this right... it was too easy...) If
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also need to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering serializers
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for allocating
memory for the Kryo buffer. Look at the code in the mc creation code in
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and passed in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells but I think
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh Spark shell
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The mahout
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we may be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the mahout plots
classes so they're really more future work.
OK. I'll read through the examples and try to do something with some
data, then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that Zeppelin
issue about weather we want to go ahead and add another interpreter.
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or start a new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 12 repositories written in Python, Batchfile, and R.
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile would
have something that ggplot2 would not, the other way around is much
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin without
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around very nice
venue then.
sorry- answering a question from a couple emails back on the thread.
If possible, I think it would be great to eventually have both (native
mahout/smile plots and ggplot), since in the future we're going to be
adding more visualization features rather than simple scatter plots etc
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with some
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that Zeppelin
issue about weather we want to go ahead and add another interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization should be a
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
Sorry- to be a little more clear, Part of what we're trying to is to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan on adding
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just reworked it a
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and surface
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can currently
be
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with AngularJS if
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs, maybe other
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting permission to
donate code.
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in more?
I hate to say it (but will do so bc this isn't public) this integration
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why not let the
user add it themselves... Add the appropriate maven artifacts, restart
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and creating an
interpretter like %spark.mahout should be fairly straight forward.
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to show off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth building a full
interpreter for or just make a really nice blog post with examples on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've done a good
amount of work on our mahout spark shell .. so let me know if you have
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The old
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting ready to
download now and start playing around.
I had talked about this briefly but it given a properly functioning
Zeppelin interpreter for Apache Mahout, one could leverage all of the
Zeppelin visualizations, anything in AngularJS, or anything in R
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
FYi...
Trevor was there for my talk, so he has some idea of Mahout Samsara.
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of algorithms
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is that during
experimentation and later in production the code is by nature scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and supports tensor
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the Spark
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin (code is
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice visualizations of ML
data (not just analytics data). We as a project are interested in
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand the
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which I have
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to proceed. We would
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel (Mahout
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at Zeppelin
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your help (as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the Mahout
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel, mahout.apache.org<
http://mahout.apache.org/> - that's where we all hangout and not
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Trevor Grant
2016-05-18 18:44:31 UTC
Permalink
Ah thank you.

Fixing now.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Hey Trevor- Just refreshed your readme. The jar that I mentioned is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an adventure in getting
Zeppelin to work right after I accidently updated to the new snapshot (free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Trevor this is very cool- I have not been able to look at it closely yet
but just a small point: I believe that you'll also need to add the
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below, however
with
out further ado, I present two notebooks that integrate Mahout + Spark +
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6 with sparkr
support running already, you may import the following raw notes directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
So my thoughs on next steps, which I'm positing only as a starting point
- Blog on HOWTO for everyman (assumes no familiarity with Mahout, and
only
enough familiarity with Zeppelin to have Zeppelin + SparkR support)
- Some syntactic sugar somewhere in Mahout to convert a matrix into a tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that Zeppelin is first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out what is
the
absolute minimum 'bare-bones' mahout we can include, e.g. does the user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do the
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act like a Mahout
interpretter
- This is taken care of by setting some env. variables, adding some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a <plot package of your choice>
To Pat's point- this is a kind of clumsy pipeline, however the Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular integration). To get
graphics out of Mahout it would be nice to not require knowledge of R
and/or python. Knowing Mahout is already bad enough but I guess the API
from the Mahout side for plotting could be Scala syntactic sugar. What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this code)
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to serialize a
mahout specific data type like vector of drm, something registered with
Kryo.
As a quick recap- we're trying to leverage Zeppelin for charting.
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into thinking
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating. Lower I
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me if I'm
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I can
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a dataframe then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic sugar to
prepare the matrix from the scala side and pass to zeppelin (using
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and an R
library
Post by Pat Ferrel
containing some functions which will pull the data out of the resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting package you
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with matplotlib
and
Post by Pat Ferrel
python (https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
All of this doesn't necessarily require any changing of the Zeppelin
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll make a blog
post but its almost a text book entry tutorial on using imports in
Zeppelin. (e.g. a tutorial would be just as at home on the Zeppelin
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS. Things
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional build
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and expose all of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin (which would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of examples where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some AngularJS
code charts it for you. At that point however, you're doing just as
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or Python and
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which in part is
what makes me fear I'm not doing this right... it was too easy...) If
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also need to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc creation code
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and passed
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells but I
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh Spark shell
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The mahout
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we may be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the mahout plots
classes so they're really more future work.
OK. I'll read through the examples and try to do something with some
data, then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another interpreter.
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 12 repositories written in Python, Batchfile, and R.
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around is much
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin without
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around very
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
sorry- answering a question from a couple emails back on the thread.
If possible, I think it would be great to eventually have both
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're going to be
adding more visualization features rather than simple scatter plots
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with some
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization should be a
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
Sorry- to be a little more clear, Part of what we're trying to is to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan on adding
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just reworked
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and surface
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with AngularJS if
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs, maybe other
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting permission
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
my java is sketchy at best, i tend to over import. I pulled in the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in more?
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why not let
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven artifacts,
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and creating an
interpretter like %spark.mahout should be fairly straight forward.
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to show off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth building a full
interpreter for or just make a really nice blog post with examples on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've done a
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know if you
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
Pat
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The old
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting ready to
download now and start playing around.
I had talked about this briefly but it given a properly functioning
Zeppelin interpreter for Apache Mahout, one could leverage all of the
Zeppelin visualizations, anything in AngularJS, or anything in R
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
FYi...
Trevor was there for my talk, so he has some idea of Mahout Samsara.
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is that during
experimentation and later in production the code is by nature
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and supports tensor
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the Spark
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin (code is
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice visualizations of
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are interested in
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand the
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which I have
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to proceed. We
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel (Mahout
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at Zeppelin
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your help (as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the Mahout
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel, mahout.apache.org<
http://mahout.apache.org/> - that's where we all hangout and not
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Shannon Quinn
2016-05-20 15:13:24 UTC
Permalink
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in zeppelin but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Hey Trevor- Just refreshed your readme. The jar that I mentioned is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an adventure in getting
Zeppelin to work right after I accidently updated to the new snapshot (free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Trevor this is very cool- I have not been able to look at it closely yet
but just a small point: I believe that you'll also need to add the
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below, however
with
out further ado, I present two notebooks that integrate Mahout + Spark +
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6 with sparkr
support running already, you may import the following raw notes directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
So my thoughs on next steps, which I'm positing only as a starting point
- Blog on HOWTO for everyman (assumes no familiarity with Mahout, and
only
enough familiarity with Zeppelin to have Zeppelin + SparkR support)
- Some syntactic sugar somewhere in Mahout to convert a matrix into a tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that Zeppelin is first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out what is
the
absolute minimum 'bare-bones' mahout we can include, e.g. does the user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do the
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act like a Mahout
interpretter
- This is taken care of by setting some env. variables, adding some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a <plot package of your choice>
To Pat's point- this is a kind of clumsy pipeline, however the Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular integration). To get
graphics out of Mahout it would be nice to not require knowledge of R
and/or python. Knowing Mahout is already bad enough but I guess the API
from the Mahout side for plotting could be Scala syntactic sugar. What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this code)
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to serialize a
mahout specific data type like vector of drm, something registered with
Kryo.
As a quick recap- we're trying to leverage Zeppelin for charting.
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into thinking
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating. Lower I
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me if I'm
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I can
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a dataframe then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic sugar to
prepare the matrix from the scala side and pass to zeppelin (using
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and an R
library
Post by Pat Ferrel
containing some functions which will pull the data out of the resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting package you
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with matplotlib
and
Post by Pat Ferrel
python (https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
All of this doesn't necessarily require any changing of the Zeppelin
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll make a blog
post but its almost a text book entry tutorial on using imports in
Zeppelin. (e.g. a tutorial would be just as at home on the Zeppelin
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS. Things
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional build
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and expose all of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin (which would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of examples where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some AngularJS
code charts it for you. At that point however, you're doing just as
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or Python and
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which in part is
what makes me fear I'm not doing this right... it was too easy...) If
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also need to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc creation code
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and passed
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells but I
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh Spark shell
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The mahout
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we may be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the mahout plots
classes so they're really more future work.
OK. I'll read through the examples and try to do something with some
data, then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another interpreter.
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 12 repositories written in Python, Batchfile, and R.
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around is much
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin without
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around very
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
sorry- answering a question from a couple emails back on the thread.
If possible, I think it would be great to eventually have both
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're going to be
adding more visualization features rather than simple scatter plots
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with some
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization should be a
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
Sorry- to be a little more clear, Part of what we're trying to is to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan on adding
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just reworked
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and surface
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with AngularJS if
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs, maybe other
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting permission
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in more?
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why not let
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven artifacts,
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and creating an
interpretter like %spark.mahout should be fairly straight forward.
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to show off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth building a full
interpreter for or just make a really nice blog post with examples on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've done a
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know if you
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
Pat
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The old
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting ready to
download now and start playing around.
I had talked about this briefly but it given a properly functioning
Zeppelin interpreter for Apache Mahout, one could leverage all of the
Zeppelin visualizations, anything in AngularJS, or anything in R
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
FYi...
Trevor was there for my talk, so he has some idea of Mahout Samsara.
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is that during
experimentation and later in production the code is by nature
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and supports tensor
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the Spark
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin (code is
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice visualizations of
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are interested in
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand the
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which I have
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to proceed. We
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel (Mahout
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at Zeppelin
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your help (as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the Mahout
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel, mahout.apache.org<
http://mahout.apache.org/> - that's where we all hangout and not
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Pat Ferrel
2016-05-20 15:22:46 UTC
Permalink
Great job Trevor, we’ll need this detail to smooth out the sharp edges and any guidance from you or the Zeppelin community will be a big help.


On May 20, 2016, at 8:13 AM, Shannon Quinn <***@gatech.edu> wrote:

Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in zeppelin but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Hey Trevor- Just refreshed your readme. The jar that I mentioned is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an adventure in getting
Zeppelin to work right after I accidently updated to the new snapshot (free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Trevor this is very cool- I have not been able to look at it closely yet
but just a small point: I believe that you'll also need to add the
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below, however
with
out further ado, I present two notebooks that integrate Mahout + Spark +
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6 with sparkr
support running already, you may import the following raw notes directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
So my thoughs on next steps, which I'm positing only as a starting point
- Blog on HOWTO for everyman (assumes no familiarity with Mahout, and
only
enough familiarity with Zeppelin to have Zeppelin + SparkR support)
- Some syntactic sugar somewhere in Mahout to convert a matrix into a tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that Zeppelin is first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out what is
the
absolute minimum 'bare-bones' mahout we can include, e.g. does the user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do the
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act like a Mahout
interpretter
- This is taken care of by setting some env. variables, adding some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a <plot package of your choice>
To Pat's point- this is a kind of clumsy pipeline, however the Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular integration). To get
graphics out of Mahout it would be nice to not require knowledge of R
and/or python. Knowing Mahout is already bad enough but I guess the API
from the Mahout side for plotting could be Scala syntactic sugar. What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this code)
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to serialize a
mahout specific data type like vector of drm, something registered with
Kryo.
As a quick recap- we're trying to leverage Zeppelin for charting.
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into thinking
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating. Lower I
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me if I'm
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I can
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a dataframe then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic sugar to
prepare the matrix from the scala side and pass to zeppelin (using
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and an R
library
Post by Pat Ferrel
containing some functions which will pull the data out of the resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting package you
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with matplotlib
and
Post by Pat Ferrel
python (https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
All of this doesn't necessarily require any changing of the Zeppelin
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll make a blog
post but its almost a text book entry tutorial on using imports in
Zeppelin. (e.g. a tutorial would be just as at home on the Zeppelin
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS. Things
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional build
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and expose all of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin (which would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of examples where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some AngularJS
code charts it for you. At that point however, you're doing just as
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or Python and
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which in part is
what makes me fear I'm not doing this right... it was too easy...) If
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also need to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc creation code
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and passed
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells but I
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh Spark shell
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The mahout
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we may be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the mahout plots
classes so they're really more future work.
OK. I'll read through the examples and try to do something with some
data, then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another interpreter.
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 12 repositories written in Python, Batchfile, and R.
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around is much
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin without
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around very
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
sorry- answering a question from a couple emails back on the thread.
If possible, I think it would be great to eventually have both
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're going to be
adding more visualization features rather than simple scatter plots
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with some
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization should be a
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
Sorry- to be a little more clear, Part of what we're trying to is to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan on adding
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just reworked
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and surface
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with AngularJS if
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs, maybe other
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting permission
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in more?
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why not let
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven artifacts,
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and creating an
interpretter like %spark.mahout should be fairly straight forward.
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to show off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth building a full
interpreter for or just make a really nice blog post with examples on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've done a
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know if you
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
Pat
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The old
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting ready to
download now and start playing around.
I had talked about this briefly but it given a properly functioning
Zeppelin interpreter for Apache Mahout, one could leverage all of the
Zeppelin visualizations, anything in AngularJS, or anything in R
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
FYi...
Trevor was there for my talk, so he has some idea of Mahout Samsara.
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is that during
experimentation and later in production the code is by nature
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and supports tensor
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the Spark
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin (code is
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice visualizations of
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are interested in
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand the
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which I have
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to proceed. We
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel (Mahout
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at Zeppelin
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your help (as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the Mahout
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel, mahout.apache.org<
http://mahout.apache.org/> - that's where we all hangout and not
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Trevor Grant
2016-05-20 16:54:10 UTC
Permalink
Dmitriy really nailed it on the head in his reply to the post which I'll
rebroadcast below. In essence the whole reason you are (theoretically)
using Mahout is the data is to big to fit in memory. If it's to big to fit
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the example I
randomly sampled a matrix.

So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.

For the Zepplin-Plotting thing, we need to have a function that will spit
out a tsv like string of the data we wanted plotted.

I agree an honest Mahout interpreter in Zeppelin is probably worth doing.
There are a couple of ways to go about it. I opened up the discussion on
***@Zeppelin and didn't get any replies. I'm going to take that to mean we
can do it in a way that makes the most sense to Mahout users...

First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.

I have some general ideas on possible approached to making an honest-mahout
interpreter but I want to play in the code and look at the Flink-Mahout
shell a bit before I try to organize my thoughts and present them.

...(2) not sure what is the point of supporting distributed anything. It is
distributed presumably because it is hard to keep it in memory. Therefore,
plotting anything distributed potentially presents 2 problems: storage
space and overplotting due to number of points. The idea is that we have to
work out algorithms that condense big data information into small plottable
information (like density grids, for example, or histograms)....

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Great job Trevor, we’ll need this detail to smooth out the sharp edges and
any guidance from you or the Zeppelin community will be a big help.
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in zeppelin
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Hey Trevor- Just refreshed your readme. The jar that I mentioned is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an adventure in
getting
Zeppelin to work right after I accidently updated to the new snapshot
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Trevor this is very cool- I have not been able to look at it closely
yet
but just a small point: I believe that you'll also need to add the
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below, however
with
out further ado, I present two notebooks that integrate Mahout + Spark
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6 with
sparkr
support running already, you may import the following raw notes
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
So my thoughs on next steps, which I'm positing only as a starting
point
- Blog on HOWTO for everyman (assumes no familiarity with Mahout, and
only
enough familiarity with Zeppelin to have Zeppelin + SparkR support)
- Some syntactic sugar somewhere in Mahout to convert a matrix into a
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that Zeppelin is
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out what is
the
absolute minimum 'bare-bones' mahout we can include, e.g. does the user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do the
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act like a
Mahout
interpretter
- This is taken care of by setting some env. variables, adding some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a <plot package of your choice>
To Pat's point- this is a kind of clumsy pipeline, however the Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular integration). To get
graphics out of Mahout it would be nice to not require knowledge of R
and/or python. Knowing Mahout is already bad enough but I guess the
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic sugar. What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this code)
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to serialize
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something registered
with
Post by Pat Ferrel
Kryo.
As a quick recap- we're trying to leverage Zeppelin for charting.
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into thinking
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating. Lower I
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me if I'm
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I can
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a dataframe
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic sugar
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin (using
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and an R
library
Post by Pat Ferrel
containing some functions which will pull the data out of the resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting package you
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with matplotlib
and
Post by Pat Ferrel
python (https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
All of this doesn't necessarily require any changing of the Zeppelin
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll make a
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using imports in
Zeppelin. (e.g. a tutorial would be just as at home on the Zeppelin
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS. Things
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional build
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and expose all
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin (which would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of examples where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're doing just as
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or Python and
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which in part
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too easy...) If
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also need to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc creation code
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and passed
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells but I
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh Spark shell
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The mahout
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we may be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the mahout plots
classes so they're really more future work.
OK. I'll read through the examples and try to do something with some
data, then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another interpreter.
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 12 repositories written in Python, Batchfile, and R.
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around is much
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin without
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around very
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
sorry- answering a question from a couple emails back on the thread.
If possible, I think it would be great to eventually have both
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're going to be
adding more visualization features rather than simple scatter plots
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with some
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization should be a
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
Sorry- to be a little more clear, Part of what we're trying to is to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan on adding
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just reworked
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and surface
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with AngularJS if
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs, maybe other
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting permission
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
my java is sketchy at best, i tend to over import. I pulled in the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in more?
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why not let
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven artifacts,
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and creating an
interpretter like %spark.mahout should be fairly straight forward.
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to show off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth building a full
interpreter for or just make a really nice blog post with examples on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've done a
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know if you
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
Pat
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The old
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting ready to
download now and start playing around.
I had talked about this briefly but it given a properly functioning
Zeppelin interpreter for Apache Mahout, one could leverage all of the
Zeppelin visualizations, anything in AngularJS, or anything in R
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
FYi...
Trevor was there for my talk, so he has some idea of Mahout Samsara.
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is that during
experimentation and later in production the code is by nature
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and supports tensor
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the Spark
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin (code is
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice visualizations of
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are interested in
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand the
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which I have
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to proceed. We
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel (Mahout
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at Zeppelin
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your help (as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the Mahout
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel, mahout.apache.org<
http://mahout.apache.org/> - that's where we all hangout and not
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Pat Ferrel
2016-05-20 17:21:06 UTC
Permalink
Agreed.

BTW I don’t want to stall progress but being the most ignorant of plot libs, I’ll ask if we should consider python and matplotlib. In another project we use python because of the RDD support on Spark though the visualizations are extremely limited in our case. If we can pass an RDD to pyspark it would allow custom reductions in python before plotting, even though we will support many natively in Mahout. I’m guessing that this would cross a context boundary and require a write to disk?

So 2 questions:
1) what does the inter language support look like with Spark python vs SparkR, can we transfer RDDs?
2) are the plot libs significantly different?

On May 20, 2016, at 9:54 AM, Trevor Grant <***@gmail.com> wrote:

Dmitriy really nailed it on the head in his reply to the post which I'll
rebroadcast below. In essence the whole reason you are (theoretically)
using Mahout is the data is to big to fit in memory. If it's to big to fit
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the example I
randomly sampled a matrix.

So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.

For the Zepplin-Plotting thing, we need to have a function that will spit
out a tsv like string of the data we wanted plotted.

I agree an honest Mahout interpreter in Zeppelin is probably worth doing.
There are a couple of ways to go about it. I opened up the discussion on
***@Zeppelin and didn't get any replies. I'm going to take that to mean we
can do it in a way that makes the most sense to Mahout users...

First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.

I have some general ideas on possible approached to making an honest-mahout
interpreter but I want to play in the code and look at the Flink-Mahout
shell a bit before I try to organize my thoughts and present them.

...(2) not sure what is the point of supporting distributed anything. It is
distributed presumably because it is hard to keep it in memory. Therefore,
plotting anything distributed potentially presents 2 problems: storage
space and overplotting due to number of points. The idea is that we have to
work out algorithms that condense big data information into small plottable
information (like density grids, for example, or histograms)....

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Great job Trevor, we’ll need this detail to smooth out the sharp edges and
any guidance from you or the Zeppelin community will be a big help.
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in zeppelin
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Hey Trevor- Just refreshed your readme. The jar that I mentioned is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an adventure in
getting
Zeppelin to work right after I accidently updated to the new snapshot
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Trevor this is very cool- I have not been able to look at it closely
yet
but just a small point: I believe that you'll also need to add the
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below, however
with
out further ado, I present two notebooks that integrate Mahout + Spark
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6 with
sparkr
support running already, you may import the following raw notes
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
So my thoughs on next steps, which I'm positing only as a starting
point
- Blog on HOWTO for everyman (assumes no familiarity with Mahout, and
only
enough familiarity with Zeppelin to have Zeppelin + SparkR support)
- Some syntactic sugar somewhere in Mahout to convert a matrix into a
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that Zeppelin is
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out what is
the
absolute minimum 'bare-bones' mahout we can include, e.g. does the user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do the
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act like a
Mahout
interpretter
- This is taken care of by setting some env. variables, adding some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a <plot package of your choice>
To Pat's point- this is a kind of clumsy pipeline, however the Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular integration). To get
graphics out of Mahout it would be nice to not require knowledge of R
and/or python. Knowing Mahout is already bad enough but I guess the
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic sugar. What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this code)
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to serialize
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something registered
with
Post by Pat Ferrel
Kryo.
As a quick recap- we're trying to leverage Zeppelin for charting.
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into thinking
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating. Lower I
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me if I'm
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I can
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a dataframe
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic sugar
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin (using
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and an R
library
Post by Pat Ferrel
containing some functions which will pull the data out of the resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting package you
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with matplotlib
and
Post by Pat Ferrel
python (https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
All of this doesn't necessarily require any changing of the Zeppelin
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll make a
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using imports in
Zeppelin. (e.g. a tutorial would be just as at home on the Zeppelin
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS. Things
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional build
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and expose all
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin (which would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of examples where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're doing just as
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or Python and
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which in part
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too easy...) If
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also need to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc creation code
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and passed
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells but I
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh Spark shell
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The mahout
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we may be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the mahout plots
classes so they're really more future work.
OK. I'll read through the examples and try to do something with some
data, then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another interpreter.
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 12 repositories written in Python, Batchfile, and R.
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around is much
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin without
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around very
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
sorry- answering a question from a couple emails back on the thread.
If possible, I think it would be great to eventually have both
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're going to be
adding more visualization features rather than simple scatter plots
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with some
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization should be a
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things." -Virgil
Sorry- to be a little more clear, Part of what we're trying to is to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan on adding
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just reworked
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and surface
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with AngularJS if
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs, maybe other
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting permission
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
my java is sketchy at best, i tend to over import. I pulled in the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in more?
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why not let
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven artifacts,
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and creating an
interpretter like %spark.mahout should be fairly straight forward.
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to show off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth building a full
interpreter for or just make a really nice blog post with examples on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've done a
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know if you
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
Pat
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The old
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting ready to
download now and start playing around.
I had talked about this briefly but it given a properly functioning
Zeppelin interpreter for Apache Mahout, one could leverage all of the
Zeppelin visualizations, anything in AngularJS, or anything in R
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things." -Virgil
FYi...
Trevor was there for my talk, so he has some idea of Mahout Samsara.
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is that during
experimentation and later in production the code is by nature
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and supports tensor
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the Spark
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin (code is
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice visualizations of
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are interested in
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand the
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which I have
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to proceed. We
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel (Mahout
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at Zeppelin
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your help (as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the Mahout
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel, mahout.apache.org<
http://mahout.apache.org/> - that's where we all hangout and not
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Trevor Grant
2016-05-20 19:18:41 UTC
Permalink
Hey Pat,

If you spit out a TSV - you can import into pyspark / matplotlib from the
resource pool in essentially the same way and use that plotting library if
you prefer. In fact you could import the tsv into pandas and use all of
the pandas plotting as well (though I think it is for the most part, also
matplotlib with some convenience functions).

https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u

In Zeppelin, unless you specify otherwise, pyspark, sparkr, spark-sql, and
scala-spark all share the same spark context you can create RDDs in one
language and access them / work on them in another (so I understand).

So in Mahout can you "save" a matrix as a RDD? e.g. something like

val myRDD = myDRM.asRDD()

And would 'myRDD' then exist in the spark context?


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Agreed.
BTW I don’t want to stall progress but being the most ignorant of plot
libs, I’ll ask if we should consider python and matplotlib. In another
project we use python because of the RDD support on Spark though the
visualizations are extremely limited in our case. If we can pass an RDD to
pyspark it would allow custom reductions in python before plotting, even
though we will support many natively in Mahout. I’m guessing that this
would cross a context boundary and require a write to disk?
1) what does the inter language support look like with Spark python vs
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
Dmitriy really nailed it on the head in his reply to the post which I'll
rebroadcast below. In essence the whole reason you are (theoretically)
using Mahout is the data is to big to fit in memory. If it's to big to fit
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the example I
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that will spit
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably worth doing.
There are a couple of ways to go about it. I opened up the discussion on
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.
I have some general ideas on possible approached to making an honest-mahout
interpreter but I want to play in the code and look at the Flink-Mahout
shell a bit before I try to organize my thoughts and present them.
...(2) not sure what is the point of supporting distributed anything. It is
distributed presumably because it is hard to keep it in memory. Therefore,
plotting anything distributed potentially presents 2 problems: storage
space and overplotting due to number of points. The idea is that we have to
work out algorithms that condense big data information into small plottable
information (like density grids, for example, or histograms)....
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Great job Trevor, we’ll need this detail to smooth out the sharp edges
and
any guidance from you or the Zeppelin community will be a big help.
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in zeppelin
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Hey Trevor- Just refreshed your readme. The jar that I mentioned is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an adventure in
getting
Zeppelin to work right after I accidently updated to the new snapshot
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Trevor this is very cool- I have not been able to look at it closely
yet
but just a small point: I believe that you'll also need to add the
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below, however
with
out further ado, I present two notebooks that integrate Mahout + Spark
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6 with
sparkr
support running already, you may import the following raw notes
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
So my thoughs on next steps, which I'm positing only as a starting
point
- Blog on HOWTO for everyman (assumes no familiarity with Mahout, and
only
enough familiarity with Zeppelin to have Zeppelin + SparkR support)
- Some syntactic sugar somewhere in Mahout to convert a matrix into a
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that Zeppelin is
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out what is
the
absolute minimum 'bare-bones' mahout we can include, e.g. does the
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do the
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act like a
Mahout
interpretter
- This is taken care of by setting some env. variables, adding some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a <plot package of your
choice>
To Pat's point- this is a kind of clumsy pipeline, however the
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular integration). To
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require knowledge of R
and/or python. Knowing Mahout is already bad enough but I guess the
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic sugar.
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this code)
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to serialize
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something registered
with
Post by Pat Ferrel
Kryo.
As a quick recap- we're trying to leverage Zeppelin for charting.
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into thinking
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating. Lower I
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me if I'm
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I can
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a dataframe
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic sugar
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin (using
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and an R
library
Post by Pat Ferrel
containing some functions which will pull the data out of the
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting package you
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with matplotlib
and
Post by Pat Ferrel
python (https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
All of this doesn't necessarily require any changing of the Zeppelin
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll make a
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using imports in
Zeppelin. (e.g. a tutorial would be just as at home on the Zeppelin
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS. Things
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional build
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and expose all
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin (which
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of examples
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're doing just as
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or Python and
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which in part
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too easy...) If
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also need to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc creation code
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and passed
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells but I
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh Spark
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The mahout
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we may be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the mahout plots
classes so they're really more future work.
OK. I'll read through the examples and try to do something with
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another interpreter.
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 12 repositories written in Python, Batchfile, and R.
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around is much
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin without
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around very
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
sorry- answering a question from a couple emails back on the thread.
If possible, I think it would be great to eventually have both
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're going to
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple scatter plots
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with some
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization should be
a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
Sorry- to be a little more clear, Part of what we're trying to is
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan on
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just reworked
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and surface
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with AngularJS if
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs, maybe other
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting permission
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
my java is sketchy at best, i tend to over import. I pulled in the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in more?
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why not let
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven artifacts,
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and creating an
interpretter like %spark.mahout should be fairly straight forward.
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to show off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth building a full
interpreter for or just make a really nice blog post with examples
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've done a
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know if you
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
Pat
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The old
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting ready to
download now and start playing around.
I had talked about this briefly but it given a properly functioning
Zeppelin interpreter for Apache Mahout, one could leverage all of
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or anything in R
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
FYi...
Trevor was there for my talk, so he has some idea of Mahout Samsara.
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is that
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by nature
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and supports
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the Spark
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin (code is
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice visualizations of
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are interested in
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand the
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which I have
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to proceed. We
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel (Mahout
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at Zeppelin
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your help (as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the Mahout
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel, mahout.apache.org<
http://mahout.apache.org/> - that's where we all hangout and not
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Suneel Marthi
2016-05-20 21:57:42 UTC
Permalink
Post by Trevor Grant
Hey Pat,
If you spit out a TSV - you can import into pyspark / matplotlib from the
resource pool in essentially the same way and use that plotting library if
you prefer. In fact you could import the tsv into pandas and use all of
the pandas plotting as well (though I think it is for the most part, also
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
In Zeppelin, unless you specify otherwise, pyspark, sparkr, spark-sql, and
scala-spark all share the same spark context you can create RDDs in one
language and access them / work on them in another (so I understand).
So in Mahout can you "save" a matrix as a RDD? e.g. something like
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Agreed.
BTW I don’t want to stall progress but being the most ignorant of plot
libs, I’ll ask if we should consider python and matplotlib. In another
project we use python because of the RDD support on Spark though the
visualizations are extremely limited in our case. If we can pass an RDD
to
Post by Pat Ferrel
pyspark it would allow custom reductions in python before plotting, even
though we will support many natively in Mahout. I’m guessing that this
would cross a context boundary and require a write to disk?
1) what does the inter language support look like with Spark python vs
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
Dmitriy really nailed it on the head in his reply to the post which I'll
rebroadcast below. In essence the whole reason you are (theoretically)
using Mahout is the data is to big to fit in memory. If it's to big to
fit
Post by Pat Ferrel
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the example I
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that will spit
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably worth doing.
There are a couple of ways to go about it. I opened up the discussion on
we
Post by Pat Ferrel
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.
I have some general ideas on possible approached to making an
honest-mahout
Post by Pat Ferrel
interpreter but I want to play in the code and look at the Flink-Mahout
shell a bit before I try to organize my thoughts and present them.
...(2) not sure what is the point of supporting distributed anything. It
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it in memory.
Therefore,
Post by Pat Ferrel
plotting anything distributed potentially presents 2 problems: storage
space and overplotting due to number of points. The idea is that we have
to
Post by Pat Ferrel
work out algorithms that condense big data information into small
plottable
Post by Pat Ferrel
information (like density grids, for example, or histograms)....
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Great job Trevor, we’ll need this detail to smooth out the sharp edges
and
any guidance from you or the Zeppelin community will be a big help.
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in zeppelin
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
Hey Trevor- Just refreshed your readme. The jar that I mentioned is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an adventure in
getting
Zeppelin to work right after I accidently updated to the new snapshot
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
Trevor this is very cool- I have not been able to look at it closely
yet
but just a small point: I believe that you'll also need to add the
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below,
however
Post by Pat Ferrel
with
out further ado, I present two notebooks that integrate Mahout +
Spark
Post by Pat Ferrel
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6 with
sparkr
support running already, you may import the following raw notes
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Pat Ferrel
So my thoughs on next steps, which I'm positing only as a starting
point
- Blog on HOWTO for everyman (assumes no familiarity with Mahout,
and
Post by Pat Ferrel
only
enough familiarity with Zeppelin to have Zeppelin + SparkR support)
- Some syntactic sugar somewhere in Mahout to convert a matrix into
a
Post by Pat Ferrel
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that Zeppelin is
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out what
is
Post by Pat Ferrel
the
absolute minimum 'bare-bones' mahout we can include, e.g. does the
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do
the
Post by Pat Ferrel
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act like a
Mahout
interpretter
- This is taken care of by setting some env. variables, adding
some
Post by Pat Ferrel
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a <plot package of your
choice>
To Pat's point- this is a kind of clumsy pipeline, however the
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular integration). To
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require knowledge
of R
Post by Pat Ferrel
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I guess the
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic sugar.
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this code)
"org.apache.spark.serializer.KryoSerializer",
Post by Pat Ferrel
Post by Pat Ferrel
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to
serialize
Post by Pat Ferrel
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something registered
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for charting.
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into thinking
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating. Lower
I
Post by Pat Ferrel
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me if I'm
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I can
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a dataframe
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic
sugar
Post by Pat Ferrel
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin (using
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and an R
library
Post by Pat Ferrel
containing some functions which will pull the data out of the
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting package you
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Pat Ferrel
and
Post by Pat Ferrel
python (https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
All of this doesn't necessarily require any changing of the
Zeppelin
Post by Pat Ferrel
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll make a
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using imports in
Zeppelin. (e.g. a tutorial would be just as at home on the Zeppelin
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS.
Things
Post by Pat Ferrel
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional build
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and expose
all
Post by Pat Ferrel
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin (which
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of examples
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're doing just
as
Post by Pat Ferrel
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or Python
and
Post by Pat Ferrel
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which in
part
Post by Pat Ferrel
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too easy...)
If
Post by Pat Ferrel
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also need
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc creation
code
Post by Pat Ferrel
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and
passed
Post by Pat Ferrel
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells but I
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh Spark
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The
mahout
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we may
be
Post by Pat Ferrel
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the mahout
plots
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do something with
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo>
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python, Batchfile, and
R.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around is
much
Post by Pat Ferrel
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin without
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around very
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on the
thread.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually have both
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're going to
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple scatter
plots
Post by Pat Ferrel
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with
some
Post by Pat Ferrel
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization should
be
Post by Pat Ferrel
a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're trying to is
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan on
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just reworked
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and surface
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with AngularJS if
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs, maybe
other
Post by Pat Ferrel
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting
permission
Post by Pat Ferrel
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
my java is sketchy at best, i tend to over import. I pulled in the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in more?
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why not
let
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven artifacts,
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and creating
an
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly straight forward.
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to show
off
Post by Pat Ferrel
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth building a
full
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post with examples
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've done a
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know if you
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The
old
Post by Pat Ferrel
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting ready
to
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could leverage all of
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or anything in R
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of Mahout
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is that
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by nature
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and supports
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the
Spark
Post by Pat Ferrel
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin (code is
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice visualizations
of
Post by Pat Ferrel
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are interested in
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand the
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which I
have
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to proceed. We
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel
(Mahout
Post by Pat Ferrel
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at Zeppelin
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your help
(as
Post by Pat Ferrel
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the Mahout
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel, mahout.apache.org
<
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all hangout and not
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Dmitriy Lyubimov
2016-05-20 22:02:07 UTC
Permalink
no parenthesis.

import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
Post by Trevor Grant
Post by Trevor Grant
Hey Pat,
If you spit out a TSV - you can import into pyspark / matplotlib from the
resource pool in essentially the same way and use that plotting library
if
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas and use all of
the pandas plotting as well (though I think it is for the most part, also
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
Post by Trevor Grant
In Zeppelin, unless you specify otherwise, pyspark, sparkr, spark-sql,
and
Post by Trevor Grant
scala-spark all share the same spark context you can create RDDs in one
language and access them / work on them in another (so I understand).
So in Mahout can you "save" a matrix as a RDD? e.g. something like
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Agreed.
BTW I don’t want to stall progress but being the most ignorant of plot
libs, I’ll ask if we should consider python and matplotlib. In another
project we use python because of the RDD support on Spark though the
visualizations are extremely limited in our case. If we can pass an RDD
to
Post by Pat Ferrel
pyspark it would allow custom reductions in python before plotting,
even
Post by Trevor Grant
Post by Pat Ferrel
though we will support many natively in Mahout. I’m guessing that this
would cross a context boundary and require a write to disk?
1) what does the inter language support look like with Spark python vs
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
Dmitriy really nailed it on the head in his reply to the post which
I'll
Post by Trevor Grant
Post by Pat Ferrel
rebroadcast below. In essence the whole reason you are (theoretically)
using Mahout is the data is to big to fit in memory. If it's to big to
fit
Post by Pat Ferrel
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the example I
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that will
spit
Post by Trevor Grant
Post by Pat Ferrel
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably worth
doing.
Post by Trevor Grant
Post by Pat Ferrel
There are a couple of ways to go about it. I opened up the discussion
on
mean
Post by Trevor Grant
we
Post by Pat Ferrel
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.
I have some general ideas on possible approached to making an
honest-mahout
Post by Pat Ferrel
interpreter but I want to play in the code and look at the Flink-Mahout
shell a bit before I try to organize my thoughts and present them.
...(2) not sure what is the point of supporting distributed anything.
It
Post by Trevor Grant
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it in memory.
Therefore,
Post by Pat Ferrel
plotting anything distributed potentially presents 2 problems: storage
space and overplotting due to number of points. The idea is that we
have
Post by Trevor Grant
to
Post by Pat Ferrel
work out algorithms that condense big data information into small
plottable
Post by Pat Ferrel
information (like density grids, for example, or histograms)....
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Great job Trevor, we’ll need this detail to smooth out the sharp
edges
Post by Trevor Grant
Post by Pat Ferrel
and
any guidance from you or the Zeppelin community will be a big help.
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in
zeppelin
Post by Trevor Grant
Post by Pat Ferrel
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
Hey Trevor- Just refreshed your readme. The jar that I mentioned
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Trevor Grant
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an adventure in
getting
Zeppelin to work right after I accidently updated to the new
snapshot
Post by Trevor Grant
Post by Pat Ferrel
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look at it
closely
Post by Trevor Grant
Post by Pat Ferrel
yet
but just a small point: I believe that you'll also need to add the
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below,
however
Post by Pat Ferrel
with
out further ado, I present two notebooks that integrate Mahout +
Spark
Post by Pat Ferrel
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6 with
sparkr
support running already, you may import the following raw notes
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Trevor Grant
Post by Pat Ferrel
So my thoughs on next steps, which I'm positing only as a starting
point
- Blog on HOWTO for everyman (assumes no familiarity with Mahout,
and
Post by Pat Ferrel
only
enough familiarity with Zeppelin to have Zeppelin + SparkR
support)
Post by Trevor Grant
Post by Pat Ferrel
- Some syntactic sugar somewhere in Mahout to convert a matrix
into
Post by Trevor Grant
a
Post by Pat Ferrel
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that Zeppelin is
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out
what
Post by Trevor Grant
is
Post by Pat Ferrel
the
absolute minimum 'bare-bones' mahout we can include, e.g. does the
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do
the
Post by Pat Ferrel
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act like a
Mahout
interpretter
- This is taken care of by setting some env. variables, adding
some
Post by Pat Ferrel
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a <plot package of your
choice>
To Pat's point- this is a kind of clumsy pipeline, however the
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular integration).
To
Post by Trevor Grant
Post by Pat Ferrel
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require knowledge
of R
Post by Pat Ferrel
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I guess
the
Post by Trevor Grant
Post by Pat Ferrel
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic sugar.
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this code)
"org.apache.spark.serializer.KryoSerializer",
Post by Pat Ferrel
Post by Pat Ferrel
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to
serialize
Post by Pat Ferrel
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
Post by Pat Ferrel
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for charting.
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into
thinking
Post by Trevor Grant
Post by Pat Ferrel
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating.
Lower
Post by Trevor Grant
I
Post by Pat Ferrel
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me if
I'm
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I
can
Post by Trevor Grant
Post by Pat Ferrel
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a
dataframe
Post by Trevor Grant
Post by Pat Ferrel
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic
sugar
Post by Pat Ferrel
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin
(using
Post by Trevor Grant
Post by Pat Ferrel
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and an R
library
Post by Pat Ferrel
containing some functions which will pull the data out of the
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting package
you
Post by Trevor Grant
Post by Pat Ferrel
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Pat Ferrel
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
All of this doesn't necessarily require any changing of the
Zeppelin
Post by Pat Ferrel
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll make
a
Post by Trevor Grant
Post by Pat Ferrel
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using imports
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on the
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS.
Things
Post by Pat Ferrel
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional
build
Post by Trevor Grant
Post by Pat Ferrel
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and expose
all
Post by Pat Ferrel
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin (which
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of examples
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're doing just
as
Post by Pat Ferrel
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or Python
and
Post by Pat Ferrel
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which in
part
Post by Pat Ferrel
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also need
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc creation
code
Post by Pat Ferrel
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and
passed
Post by Pat Ferrel
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells but I
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh Spark
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The
mahout
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we
may
Post by Trevor Grant
be
Post by Pat Ferrel
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the mahout
plots
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do something with
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo>
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python, Batchfile,
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical
smile
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around is
much
Post by Pat Ferrel
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin
without
Post by Trevor Grant
Post by Pat Ferrel
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around
very
Post by Trevor Grant
Post by Pat Ferrel
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on the
thread.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually have both
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're going
to
Post by Trevor Grant
Post by Pat Ferrel
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple scatter
plots
Post by Pat Ferrel
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with
some
Post by Pat Ferrel
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization should
be
Post by Pat Ferrel
a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're trying to
is
Post by Trevor Grant
Post by Pat Ferrel
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan on
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just
reworked
Post by Trevor Grant
Post by Pat Ferrel
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and
surface
Post by Trevor Grant
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with AngularJS
if
Post by Trevor Grant
Post by Pat Ferrel
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs, maybe
other
Post by Pat Ferrel
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting
permission
Post by Pat Ferrel
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I pulled in
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in more?
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why not
let
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven artifacts,
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly straight
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to show
off
Post by Pat Ferrel
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth building a
full
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post with
examples
Post by Trevor Grant
Post by Pat Ferrel
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've
done a
Post by Trevor Grant
Post by Pat Ferrel
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know if
you
Post by Trevor Grant
Post by Pat Ferrel
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The
old
Post by Pat Ferrel
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could leverage all
of
Post by Trevor Grant
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or anything in R
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of Mahout
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is that
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by nature
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and supports
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the
Spark
Post by Pat Ferrel
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin (code
is
Post by Trevor Grant
Post by Pat Ferrel
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
Post by Pat Ferrel
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are interested
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand
the
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which I
have
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to proceed. We
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel
(Mahout
Post by Pat Ferrel
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at Zeppelin
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your help
(as
Post by Pat Ferrel
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the Mahout
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org
Post by Trevor Grant
<
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all hangout and
not
Post by Trevor Grant
Post by Pat Ferrel
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Andrew Musselman
2016-05-20 22:36:27 UTC
Permalink
Trevor, my zeppelin source is at this version:

<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>

And yes you're right the artifacts weren't added to the dependencies; is
that a feature in more modern zep?
Post by Dmitriy Lyubimov
no parenthesis.
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
Post by Suneel Marthi
Post by Trevor Grant
Hey Pat,
If you spit out a TSV - you can import into pyspark / matplotlib from
the
Post by Suneel Marthi
Post by Trevor Grant
resource pool in essentially the same way and use that plotting library
if
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas and use all
of
Post by Suneel Marthi
Post by Trevor Grant
the pandas plotting as well (though I think it is for the most part,
also
Post by Suneel Marthi
Post by Trevor Grant
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
Post by Suneel Marthi
Post by Trevor Grant
In Zeppelin, unless you specify otherwise, pyspark, sparkr, spark-sql,
and
Post by Trevor Grant
scala-spark all share the same spark context you can create RDDs in one
language and access them / work on them in another (so I understand).
So in Mahout can you "save" a matrix as a RDD? e.g. something like
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
Agreed.
BTW I don’t want to stall progress but being the most ignorant of
plot
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
libs, I’ll ask if we should consider python and matplotlib. In
another
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
project we use python because of the RDD support on Spark though the
visualizations are extremely limited in our case. If we can pass an
RDD
Post by Suneel Marthi
Post by Trevor Grant
to
Post by Pat Ferrel
pyspark it would allow custom reductions in python before plotting,
even
Post by Trevor Grant
Post by Pat Ferrel
though we will support many natively in Mahout. I’m guessing that
this
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
would cross a context boundary and require a write to disk?
1) what does the inter language support look like with Spark python
vs
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
Dmitriy really nailed it on the head in his reply to the post which
I'll
Post by Trevor Grant
Post by Pat Ferrel
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
using Mahout is the data is to big to fit in memory. If it's to big
to
Post by Suneel Marthi
Post by Trevor Grant
fit
Post by Pat Ferrel
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the example I
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that will
spit
Post by Trevor Grant
Post by Pat Ferrel
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably worth
doing.
Post by Trevor Grant
Post by Pat Ferrel
There are a couple of ways to go about it. I opened up the discussion
on
mean
Post by Trevor Grant
we
Post by Pat Ferrel
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.
I have some general ideas on possible approached to making an
honest-mahout
Post by Pat Ferrel
interpreter but I want to play in the code and look at the
Flink-Mahout
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
shell a bit before I try to organize my thoughts and present them.
...(2) not sure what is the point of supporting distributed anything.
It
Post by Trevor Grant
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it in memory.
Therefore,
storage
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
space and overplotting due to number of points. The idea is that we
have
Post by Trevor Grant
to
Post by Pat Ferrel
work out algorithms that condense big data information into small
plottable
Post by Pat Ferrel
information (like density grids, for example, or histograms)....
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Great job Trevor, we’ll need this detail to smooth out the sharp
edges
Post by Trevor Grant
Post by Pat Ferrel
and
any guidance from you or the Zeppelin community will be a big help.
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in
zeppelin
Post by Trevor Grant
Post by Pat Ferrel
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I mentioned
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an adventure
in
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
getting
Zeppelin to work right after I accidently updated to the new
snapshot
Post by Trevor Grant
Post by Pat Ferrel
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look at it
closely
Post by Trevor Grant
Post by Pat Ferrel
yet
but just a small point: I believe that you'll also need to add
the
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below,
however
Post by Pat Ferrel
with
out further ado, I present two notebooks that integrate Mahout +
Spark
Post by Pat Ferrel
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6
with
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
sparkr
support running already, you may import the following raw notes
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
So my thoughs on next steps, which I'm positing only as a
starting
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
point
- Blog on HOWTO for everyman (assumes no familiarity with
Mahout,
Post by Suneel Marthi
Post by Trevor Grant
and
Post by Pat Ferrel
only
enough familiarity with Zeppelin to have Zeppelin + SparkR
support)
Post by Trevor Grant
Post by Pat Ferrel
- Some syntactic sugar somewhere in Mahout to convert a matrix
into
Post by Trevor Grant
a
Post by Pat Ferrel
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration
feels
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that Zeppelin
is
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out
what
Post by Trevor Grant
is
Post by Pat Ferrel
the
absolute minimum 'bare-bones' mahout we can include, e.g. does
the
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to
do
Post by Suneel Marthi
Post by Trevor Grant
the
Post by Pat Ferrel
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act
like a
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Mahout
interpretter
- This is taken care of by setting some env. variables,
adding
Post by Suneel Marthi
Post by Trevor Grant
some
Post by Pat Ferrel
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource
pool
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't
have
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
zeppelin) in R (python soon) and create a <plot package of your
choice>
To Pat's point- this is a kind of clumsy pipeline, however the
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular integration).
To
Post by Trevor Grant
Post by Pat Ferrel
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require
knowledge
Post by Suneel Marthi
Post by Trevor Grant
of R
Post by Pat Ferrel
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I guess
the
Post by Trevor Grant
Post by Pat Ferrel
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic
sugar.
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this code)
"org.apache.spark.serializer.KryoSerializer",
Post by Pat Ferrel
Post by Pat Ferrel
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to
serialize
Post by Pat Ferrel
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
Post by Pat Ferrel
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into
thinking
Post by Trevor Grant
Post by Pat Ferrel
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating.
Lower
Post by Trevor Grant
I
Post by Pat Ferrel
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me if
I'm
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I
can
Post by Trevor Grant
Post by Pat Ferrel
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a
dataframe
Post by Trevor Grant
Post by Pat Ferrel
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic
sugar
Post by Pat Ferrel
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin
(using
Post by Trevor Grant
Post by Pat Ferrel
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and an
R
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
library
Post by Pat Ferrel
containing some functions which will pull the data out of the
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting package
you
Post by Trevor Grant
Post by Pat Ferrel
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Pat Ferrel
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
All of this doesn't necessarily require any changing of the
Zeppelin
Post by Pat Ferrel
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll
make
Post by Suneel Marthi
a
Post by Trevor Grant
Post by Pat Ferrel
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using imports
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on the
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS.
Things
Post by Pat Ferrel
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional
build
Post by Trevor Grant
Post by Pat Ferrel
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and
expose
Post by Suneel Marthi
Post by Trevor Grant
all
Post by Pat Ferrel
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin
(which
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of
examples
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're doing
just
Post by Suneel Marthi
Post by Trevor Grant
as
Post by Pat Ferrel
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or
Python
Post by Suneel Marthi
Post by Trevor Grant
and
Post by Pat Ferrel
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which in
part
Post by Pat Ferrel
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also
need
Post by Suneel Marthi
Post by Trevor Grant
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc
creation
Post by Suneel Marthi
Post by Trevor Grant
code
Post by Pat Ferrel
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and
passed
Post by Pat Ferrel
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells
but I
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh
Spark
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an
sc.
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The
mahout
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we
may
Post by Trevor Grant
be
Post by Pat Ferrel
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the mahout
plots
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do something
with
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev
or
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo>
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python, Batchfile,
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical
smile
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around is
much
Post by Pat Ferrel
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin
without
Post by Trevor Grant
Post by Pat Ferrel
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around
very
Post by Trevor Grant
Post by Pat Ferrel
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on the
thread.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually have
both
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're
going
Post by Suneel Marthi
to
Post by Trevor Grant
Post by Pat Ferrel
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple scatter
plots
Post by Pat Ferrel
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something
with
Post by Suneel Marthi
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization
should
Post by Suneel Marthi
Post by Trevor Grant
be
Post by Pat Ferrel
a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're trying
to
Post by Suneel Marthi
is
Post by Trevor Grant
Post by Pat Ferrel
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan on
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just
reworked
Post by Trevor Grant
Post by Pat Ferrel
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and
surface
Post by Trevor Grant
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with AngularJS
if
Post by Trevor Grant
Post by Pat Ferrel
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs, maybe
other
Post by Pat Ferrel
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting
permission
Post by Pat Ferrel
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I pulled in
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in
more?
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why
not
Post by Suneel Marthi
Post by Trevor Grant
let
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven
artifacts,
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly straight
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to
show
Post by Suneel Marthi
Post by Trevor Grant
off
Post by Pat Ferrel
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth building
a
Post by Suneel Marthi
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post with
examples
Post by Trevor Grant
Post by Pat Ferrel
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've
done a
Post by Trevor Grant
Post by Pat Ferrel
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know if
you
Post by Trevor Grant
Post by Pat Ferrel
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago.
(The
Post by Suneel Marthi
Post by Trevor Grant
old
Post by Pat Ferrel
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could leverage all
of
Post by Trevor Grant
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or anything
in R
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of Mahout
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is that
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by nature
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and supports
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the
Spark
Post by Pat Ferrel
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin
(code
Post by Suneel Marthi
is
Post by Trevor Grant
Post by Pat Ferrel
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
Post by Pat Ferrel
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are interested
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand
the
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which I
have
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to proceed.
We
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation,
I
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel
(Mahout
Post by Pat Ferrel
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at Zeppelin
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your
help
Post by Suneel Marthi
Post by Trevor Grant
(as
Post by Pat Ferrel
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the
Mahout
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org
Post by Trevor Grant
<
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all hangout and
not
Post by Trevor Grant
Post by Pat Ferrel
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Andrew Musselman
2016-05-20 22:41:32 UTC
Permalink
Oh might have been a browser cache issue; even after a couple hard refresh
methods using another browser has the import link.

On Fri, May 20, 2016 at 3:36 PM, Andrew Musselman <
Post by Andrew Musselman
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the dependencies; is
that a feature in more modern zep?
Post by Dmitriy Lyubimov
no parenthesis.
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
Post by Suneel Marthi
Post by Trevor Grant
Hey Pat,
If you spit out a TSV - you can import into pyspark / matplotlib from
the
Post by Suneel Marthi
Post by Trevor Grant
resource pool in essentially the same way and use that plotting
library
Post by Suneel Marthi
if
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas and use all
of
Post by Suneel Marthi
Post by Trevor Grant
the pandas plotting as well (though I think it is for the most part,
also
Post by Suneel Marthi
Post by Trevor Grant
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
Post by Suneel Marthi
Post by Trevor Grant
In Zeppelin, unless you specify otherwise, pyspark, sparkr, spark-sql,
and
Post by Trevor Grant
scala-spark all share the same spark context you can create RDDs in
one
Post by Suneel Marthi
Post by Trevor Grant
language and access them / work on them in another (so I understand).
So in Mahout can you "save" a matrix as a RDD? e.g. something like
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Agreed.
BTW I don’t want to stall progress but being the most ignorant of
plot
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
libs, I’ll ask if we should consider python and matplotlib. In
another
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
project we use python because of the RDD support on Spark though the
visualizations are extremely limited in our case. If we can pass an
RDD
Post by Suneel Marthi
Post by Trevor Grant
to
Post by Pat Ferrel
pyspark it would allow custom reductions in python before plotting,
even
Post by Trevor Grant
Post by Pat Ferrel
though we will support many natively in Mahout. I’m guessing that
this
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
would cross a context boundary and require a write to disk?
1) what does the inter language support look like with Spark python
vs
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
Dmitriy really nailed it on the head in his reply to the post which
I'll
Post by Trevor Grant
Post by Pat Ferrel
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
using Mahout is the data is to big to fit in memory. If it's to
big to
Post by Suneel Marthi
Post by Trevor Grant
fit
Post by Pat Ferrel
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the example I
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that will
spit
Post by Trevor Grant
Post by Pat Ferrel
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably worth
doing.
Post by Trevor Grant
Post by Pat Ferrel
There are a couple of ways to go about it. I opened up the
discussion
Post by Suneel Marthi
on
mean
Post by Trevor Grant
we
Post by Pat Ferrel
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.
I have some general ideas on possible approached to making an
honest-mahout
Post by Pat Ferrel
interpreter but I want to play in the code and look at the
Flink-Mahout
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
shell a bit before I try to organize my thoughts and present them.
...(2) not sure what is the point of supporting distributed
anything.
Post by Suneel Marthi
It
Post by Trevor Grant
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it in memory.
Therefore,
storage
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
space and overplotting due to number of points. The idea is that we
have
Post by Trevor Grant
to
Post by Pat Ferrel
work out algorithms that condense big data information into small
plottable
Post by Pat Ferrel
information (like density grids, for example, or histograms)....
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Great job Trevor, we’ll need this detail to smooth out the sharp
edges
Post by Trevor Grant
Post by Pat Ferrel
and
any guidance from you or the Zeppelin community will be a big
help.
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in
zeppelin
Post by Trevor Grant
Post by Pat Ferrel
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I
mentioned
Post by Suneel Marthi
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an
adventure in
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
getting
Zeppelin to work right after I accidently updated to the new
snapshot
Post by Trevor Grant
Post by Pat Ferrel
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look at it
closely
Post by Trevor Grant
Post by Pat Ferrel
yet
but just a small point: I believe that you'll also need to add
the
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below,
however
Post by Pat Ferrel
with
out further ado, I present two notebooks that integrate Mahout
+
Post by Suneel Marthi
Post by Trevor Grant
Spark
Post by Pat Ferrel
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6
with
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
sparkr
support running already, you may import the following raw notes
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
So my thoughs on next steps, which I'm positing only as a
starting
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
point
- Blog on HOWTO for everyman (assumes no familiarity with
Mahout,
Post by Suneel Marthi
Post by Trevor Grant
and
Post by Pat Ferrel
only
enough familiarity with Zeppelin to have Zeppelin + SparkR
support)
Post by Trevor Grant
Post by Pat Ferrel
- Some syntactic sugar somewhere in Mahout to convert a matrix
into
Post by Trevor Grant
a
Post by Pat Ferrel
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration
feels
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that
Zeppelin is
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out
what
Post by Trevor Grant
is
Post by Pat Ferrel
the
absolute minimum 'bare-bones' mahout we can include, e.g. does
the
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how
to do
Post by Suneel Marthi
Post by Trevor Grant
the
Post by Pat Ferrel
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act
like a
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Mahout
interpretter
- This is taken care of by setting some env. variables,
adding
Post by Suneel Marthi
Post by Trevor Grant
some
Post by Pat Ferrel
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource
pool
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't
have
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
zeppelin) in R (python soon) and create a <plot package of your
choice>
To Pat's point- this is a kind of clumsy pipeline, however the
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular
integration).
Post by Suneel Marthi
To
Post by Trevor Grant
Post by Pat Ferrel
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require
knowledge
Post by Suneel Marthi
Post by Trevor Grant
of R
Post by Pat Ferrel
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I
guess
Post by Suneel Marthi
the
Post by Trevor Grant
Post by Pat Ferrel
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic
sugar.
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this
code)
Post by Suneel Marthi
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
Post by Pat Ferrel
Post by Pat Ferrel
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to
serialize
Post by Pat Ferrel
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
Post by Pat Ferrel
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into
thinking
Post by Trevor Grant
Post by Pat Ferrel
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating.
Lower
Post by Trevor Grant
I
Post by Pat Ferrel
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me
if
Post by Suneel Marthi
I'm
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I
can
Post by Trevor Grant
Post by Pat Ferrel
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a
dataframe
Post by Trevor Grant
Post by Pat Ferrel
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic
sugar
Post by Pat Ferrel
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin
(using
Post by Trevor Grant
Post by Pat Ferrel
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and
an R
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
library
Post by Pat Ferrel
containing some functions which will pull the data out of the
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting
package
Post by Suneel Marthi
you
Post by Trevor Grant
Post by Pat Ferrel
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Pat Ferrel
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
All of this doesn't necessarily require any changing of the
Zeppelin
Post by Pat Ferrel
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll
make
Post by Suneel Marthi
a
Post by Trevor Grant
Post by Pat Ferrel
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using
imports
Post by Suneel Marthi
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on the
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS.
Things
Post by Pat Ferrel
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional
build
Post by Trevor Grant
Post by Pat Ferrel
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and
expose
Post by Suneel Marthi
Post by Trevor Grant
all
Post by Pat Ferrel
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin
(which
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of
examples
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're doing
just
Post by Suneel Marthi
Post by Trevor Grant
as
Post by Pat Ferrel
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or
Python
Post by Suneel Marthi
Post by Trevor Grant
and
Post by Pat Ferrel
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which
in
Post by Suneel Marthi
Post by Trevor Grant
part
Post by Pat Ferrel
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also
need
Post by Suneel Marthi
Post by Trevor Grant
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc
creation
Post by Suneel Marthi
Post by Trevor Grant
code
Post by Pat Ferrel
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and
passed
Post by Pat Ferrel
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells
but I
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh
Spark
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an
sc.
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The
mahout
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we
may
Post by Trevor Grant
be
Post by Pat Ferrel
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the
mahout
Post by Suneel Marthi
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do something
with
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev
or
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
]<
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo>
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python, Batchfile,
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical
smile
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around
is
Post by Suneel Marthi
Post by Trevor Grant
much
Post by Pat Ferrel
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin
without
Post by Trevor Grant
Post by Pat Ferrel
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around
very
Post by Trevor Grant
Post by Pat Ferrel
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on the
thread.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually have
both
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're
going
Post by Suneel Marthi
to
Post by Trevor Grant
Post by Pat Ferrel
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple scatter
plots
Post by Pat Ferrel
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something
with
Post by Suneel Marthi
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization
should
Post by Suneel Marthi
Post by Trevor Grant
be
Post by Pat Ferrel
a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're trying
to
Post by Suneel Marthi
is
Post by Trevor Grant
Post by Pat Ferrel
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan
on
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just
reworked
Post by Trevor Grant
Post by Pat Ferrel
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and
surface
Post by Trevor Grant
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples on the closed
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with
AngularJS
Post by Suneel Marthi
if
Post by Trevor Grant
Post by Pat Ferrel
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs,
maybe
Post by Suneel Marthi
Post by Trevor Grant
other
Post by Pat Ferrel
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting
permission
Post by Pat Ferrel
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I pulled
in
Post by Suneel Marthi
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in
more?
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why
not
Post by Suneel Marthi
Post by Trevor Grant
let
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven
artifacts,
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly straight
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to
show
Post by Suneel Marthi
Post by Trevor Grant
off
Post by Pat Ferrel
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth
building a
Post by Suneel Marthi
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post with
examples
Post by Trevor Grant
Post by Pat Ferrel
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've
done a
Post by Trevor Grant
Post by Pat Ferrel
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know if
you
Post by Trevor Grant
Post by Pat Ferrel
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Pat Ferrel
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago.
(The
Post by Suneel Marthi
Post by Trevor Grant
old
Post by Pat Ferrel
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could leverage
all
Post by Suneel Marthi
of
Post by Trevor Grant
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or anything
in R
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of Mahout
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is
that
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by nature
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and
supports
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the
Spark
Post by Pat Ferrel
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin
(code
Post by Suneel Marthi
is
Post by Trevor Grant
Post by Pat Ferrel
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
Post by Pat Ferrel
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are
interested
Post by Suneel Marthi
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand
the
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which
I
Post by Suneel Marthi
Post by Trevor Grant
have
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to
proceed. We
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our
conversation, I
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel
(Mahout
Post by Pat Ferrel
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at
Zeppelin
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your
help
Post by Suneel Marthi
Post by Trevor Grant
(as
Post by Pat Ferrel
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the
Mahout
Post by Suneel Marthi
Post by Trevor Grant
Post by Pat Ferrel
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org
Post by Trevor Grant
<
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all hangout and
not
Post by Trevor Grant
Post by Pat Ferrel
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Andrew Musselman
2016-05-20 22:46:27 UTC
Permalink
Now this, definitely would help to clarify the instructions; let me know if
I can help.

import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
java.lang.NoClassDefFoundError: org/apache/mahout/math/AbstractMatrix
at
org.apache.mahout.sparkbindings.SparkDistributedContext.<init>(SparkDistributedContext.scala:25)
at org.apache.mahout.sparkbindings.package$.sc2sdc(package.scala:98)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:59)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:64)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:66)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:68)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:70)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:72)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:74)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:76)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:78)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:80)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:82)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:84)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:86)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:88)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:90)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:92)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:94)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:96)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:98)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:100)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:102)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:104)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:106)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:108)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:110)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:112)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:114)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:116)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:118)
at $iwC$$iwC$$iwC.<init>(<console>:120)
at $iwC$$iwC.<init>(<console>:122)
at $iwC.<init>(<console>:124)
at <init>(<console>:126)
at .<init>(<console>:130)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at
org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:812)
at
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:755)
at
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:748)
at
org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:331)
at org.apache.zeppelin.scheduler.Job.run(Job.java:171)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException:
org.apache.mahout.math.AbstractMatrix
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 64 more

On Fri, May 20, 2016 at 3:41 PM, Andrew Musselman <
Post by Andrew Musselman
Oh might have been a browser cache issue; even after a couple hard refresh
methods using another browser has the import link.
On Fri, May 20, 2016 at 3:36 PM, Andrew Musselman <
Post by Andrew Musselman
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the dependencies; is
that a feature in more modern zep?
Post by Dmitriy Lyubimov
no parenthesis.
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
Post by Trevor Grant
Hey Pat,
If you spit out a TSV - you can import into pyspark / matplotlib
from the
Post by Trevor Grant
resource pool in essentially the same way and use that plotting
library
if
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas and use
all of
Post by Trevor Grant
the pandas plotting as well (though I think it is for the most part,
also
Post by Trevor Grant
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
Post by Trevor Grant
In Zeppelin, unless you specify otherwise, pyspark, sparkr,
spark-sql,
and
Post by Trevor Grant
scala-spark all share the same spark context you can create RDDs in
one
Post by Trevor Grant
language and access them / work on them in another (so I understand).
So in Mahout can you "save" a matrix as a RDD? e.g. something like
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by Pat Ferrel
Agreed.
BTW I don’t want to stall progress but being the most ignorant of
plot
Post by Trevor Grant
Post by Pat Ferrel
libs, I’ll ask if we should consider python and matplotlib. In
another
Post by Trevor Grant
Post by Pat Ferrel
project we use python because of the RDD support on Spark though
the
Post by Trevor Grant
Post by Pat Ferrel
visualizations are extremely limited in our case. If we can pass
an RDD
Post by Trevor Grant
to
Post by Pat Ferrel
pyspark it would allow custom reductions in python before plotting,
even
Post by Trevor Grant
Post by Pat Ferrel
though we will support many natively in Mahout. I’m guessing that
this
Post by Trevor Grant
Post by Pat Ferrel
would cross a context boundary and require a write to disk?
1) what does the inter language support look like with Spark
python vs
Post by Trevor Grant
Post by Pat Ferrel
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
On May 20, 2016, at 9:54 AM, Trevor Grant <
Dmitriy really nailed it on the head in his reply to the post which
I'll
Post by Trevor Grant
Post by Pat Ferrel
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Trevor Grant
Post by Pat Ferrel
using Mahout is the data is to big to fit in memory. If it's to
big to
Post by Trevor Grant
fit
Post by Pat Ferrel
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the example
I
Post by Trevor Grant
Post by Pat Ferrel
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that
will
spit
Post by Trevor Grant
Post by Pat Ferrel
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably worth
doing.
Post by Trevor Grant
Post by Pat Ferrel
There are a couple of ways to go about it. I opened up the
discussion
on
mean
Post by Trevor Grant
we
Post by Pat Ferrel
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.
I have some general ideas on possible approached to making an
honest-mahout
Post by Pat Ferrel
interpreter but I want to play in the code and look at the
Flink-Mahout
Post by Trevor Grant
Post by Pat Ferrel
shell a bit before I try to organize my thoughts and present them.
...(2) not sure what is the point of supporting distributed
anything.
It
Post by Trevor Grant
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it in memory.
Therefore,
storage
Post by Trevor Grant
Post by Pat Ferrel
space and overplotting due to number of points. The idea is that we
have
Post by Trevor Grant
to
Post by Pat Ferrel
work out algorithms that condense big data information into small
plottable
Post by Pat Ferrel
information (like density grids, for example, or histograms)....
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by Pat Ferrel
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth out the sharp
edges
Post by Trevor Grant
Post by Pat Ferrel
and
any guidance from you or the Zeppelin community will be a big
help.
Post by Trevor Grant
Post by Pat Ferrel
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in
zeppelin
Post by Trevor Grant
Post by Pat Ferrel
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I
mentioned
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Trevor Grant
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an
adventure in
Post by Trevor Grant
Post by Pat Ferrel
getting
Zeppelin to work right after I accidently updated to the new
snapshot
Post by Trevor Grant
Post by Pat Ferrel
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look at it
closely
Post by Trevor Grant
Post by Pat Ferrel
yet
but just a small point: I believe that you'll also need to
add the
Post by Trevor Grant
Post by Pat Ferrel
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix,
and
Post by Trevor Grant
Post by Pat Ferrel
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below,
however
Post by Pat Ferrel
with
out further ado, I present two notebooks that integrate
Mahout +
Post by Trevor Grant
Spark
Post by Pat Ferrel
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6
with
Post by Trevor Grant
Post by Pat Ferrel
sparkr
support running already, you may import the following raw
notes
Post by Trevor Grant
Post by Pat Ferrel
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Trevor Grant
Post by Pat Ferrel
So my thoughs on next steps, which I'm positing only as a
starting
Post by Trevor Grant
Post by Pat Ferrel
point
- Blog on HOWTO for everyman (assumes no familiarity with
Mahout,
Post by Trevor Grant
and
Post by Pat Ferrel
only
enough familiarity with Zeppelin to have Zeppelin + SparkR
support)
Post by Trevor Grant
Post by Pat Ferrel
- Some syntactic sugar somewhere in Mahout to convert a matrix
into
Post by Trevor Grant
a
Post by Pat Ferrel
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration
feels
Post by Trevor Grant
Post by Pat Ferrel
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that
Zeppelin is
Post by Trevor Grant
Post by Pat Ferrel
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out
what
Post by Trevor Grant
is
Post by Pat Ferrel
the
absolute minimum 'bare-bones' mahout we can include, e.g.
does the
Post by Trevor Grant
Post by Pat Ferrel
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how
to do
Post by Trevor Grant
the
Post by Pat Ferrel
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act
like a
Post by Trevor Grant
Post by Pat Ferrel
Mahout
interpretter
- This is taken care of by setting some env. variables,
adding
Post by Trevor Grant
some
Post by Pat Ferrel
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource
pool
Post by Trevor Grant
Post by Pat Ferrel
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't
have
Post by Trevor Grant
Post by Pat Ferrel
zeppelin) in R (python soon) and create a <plot package of
your
Post by Trevor Grant
Post by Pat Ferrel
choice>
To Pat's point- this is a kind of clumsy pipeline, however the
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular
integration).
To
Post by Trevor Grant
Post by Pat Ferrel
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require
knowledge
Post by Trevor Grant
of R
Post by Pat Ferrel
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I
guess
the
Post by Trevor Grant
Post by Pat Ferrel
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic
sugar.
Post by Trevor Grant
Post by Pat Ferrel
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this
code)
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
Post by Pat Ferrel
Post by Pat Ferrel
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to
serialize
Post by Pat Ferrel
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
Post by Pat Ferrel
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into
thinking
Post by Trevor Grant
Post by Pat Ferrel
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating.
Lower
Post by Trevor Grant
I
Post by Pat Ferrel
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me
if
I'm
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that
I
can
Post by Trevor Grant
Post by Pat Ferrel
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a
dataframe
Post by Trevor Grant
Post by Pat Ferrel
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some
syntactic
Post by Trevor Grant
sugar
Post by Pat Ferrel
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin
(using
Post by Trevor Grant
Post by Pat Ferrel
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and
an R
Post by Trevor Grant
Post by Pat Ferrel
library
Post by Pat Ferrel
containing some functions which will pull the data out of the
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting
package
you
Post by Trevor Grant
Post by Pat Ferrel
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Pat Ferrel
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
All of this doesn't necessarily require any changing of the
Zeppelin
Post by Pat Ferrel
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll
make
a
Post by Trevor Grant
Post by Pat Ferrel
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using
imports
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on the
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS.
Things
Post by Pat Ferrel
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional
build
Post by Trevor Grant
Post by Pat Ferrel
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and
expose
Post by Trevor Grant
all
Post by Pat Ferrel
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin
(which
Post by Trevor Grant
Post by Pat Ferrel
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of
examples
Post by Trevor Grant
Post by Pat Ferrel
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're doing
just
Post by Trevor Grant
as
Post by Pat Ferrel
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or
Python
Post by Trevor Grant
and
Post by Pat Ferrel
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which
in
Post by Trevor Grant
part
Post by Pat Ferrel
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also
need
Post by Trevor Grant
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc
creation
Post by Trevor Grant
code
Post by Pat Ferrel
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and
passed
Post by Pat Ferrel
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells
but I
Post by Trevor Grant
Post by Pat Ferrel
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh
Spark
Post by Trevor Grant
Post by Pat Ferrel
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an
sc.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The
mahout
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point
we
may
Post by Trevor Grant
be
Post by Pat Ferrel
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the
mahout
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do something
with
Post by Trevor Grant
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it
(probably
Post by Trevor Grant
Post by Pat Ferrel
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc
dev or
Post by Trevor Grant
Post by Pat Ferrel
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
]<
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo>
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python,
Batchfile,
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical
smile
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around
is
Post by Trevor Grant
much
Post by Pat Ferrel
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin
without
Post by Trevor Grant
Post by Pat Ferrel
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around
very
Post by Trevor Grant
Post by Pat Ferrel
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on the
thread.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually have
both
Post by Trevor Grant
Post by Pat Ferrel
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're
going
to
Post by Trevor Grant
Post by Pat Ferrel
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple
scatter
Post by Trevor Grant
plots
Post by Pat Ferrel
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the
pngs.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something
with
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization
should
Post by Trevor Grant
be
Post by Pat Ferrel
a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're
trying to
is
Post by Trevor Grant
Post by Pat Ferrel
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan
on
Post by Trevor Grant
Post by Pat Ferrel
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just
reworked
Post by Trevor Grant
Post by Pat Ferrel
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and
surface
Post by Trevor Grant
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples on the closed
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with
AngularJS
if
Post by Trevor Grant
Post by Pat Ferrel
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs,
maybe
Post by Trevor Grant
other
Post by Pat Ferrel
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting
permission
Post by Pat Ferrel
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I pulled
in
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in
more?
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why
not
Post by Trevor Grant
let
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven
artifacts,
Post by Trevor Grant
Post by Pat Ferrel
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly straight
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to
show
Post by Trevor Grant
off
Post by Pat Ferrel
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth
building a
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post with
examples
Post by Trevor Grant
Post by Pat Ferrel
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've
done a
Post by Trevor Grant
Post by Pat Ferrel
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know
if
you
Post by Trevor Grant
Post by Pat Ferrel
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Pat Ferrel
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago.
(The
Post by Trevor Grant
old
Post by Pat Ferrel
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could leverage
all
of
Post by Trevor Grant
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or anything
in R
Post by Trevor Grant
Post by Pat Ferrel
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of Mahout
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is
that
Post by Trevor Grant
Post by Pat Ferrel
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by
nature
Post by Trevor Grant
Post by Pat Ferrel
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and
supports
Post by Trevor Grant
Post by Pat Ferrel
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of
the
Post by Trevor Grant
Spark
Post by Pat Ferrel
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin
(code
is
Post by Trevor Grant
Post by Pat Ferrel
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
Post by Pat Ferrel
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are
interested
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I
understand
the
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS,
which I
Post by Trevor Grant
have
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to
proceed. We
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our
conversation, I
Post by Trevor Grant
Post by Pat Ferrel
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel
(Mahout
Post by Pat Ferrel
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your
help
Post by Trevor Grant
(as
Post by Pat Ferrel
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the
Mahout
Post by Trevor Grant
Post by Pat Ferrel
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org
Post by Trevor Grant
<
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all hangout
and
not
Post by Trevor Grant
Post by Pat Ferrel
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Suneel Marthi
2016-05-20 22:48:02 UTC
Permalink
R u seeing a similar thing in plain Mahout-Spark shell too ?

On Fri, May 20, 2016 at 6:46 PM, Andrew Musselman <
Post by Andrew Musselman
Now this, definitely would help to clarify the instructions; let me know if
I can help.
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
java.lang.NoClassDefFoundError: org/apache/mahout/math/AbstractMatrix
at
org.apache.mahout.sparkbindings.SparkDistributedContext.<init>(SparkDistributedContext.scala:25)
at org.apache.mahout.sparkbindings.package$.sc2sdc(package.scala:98)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:59)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:64)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:66)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:68)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:70)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:72)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:74)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:76)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:78)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:80)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:82)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:84)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:86)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:88)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:90)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:92)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:94)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:96)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:98)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:100)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:102)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:104)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:106)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:108)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:110)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:112)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:114)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:116)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:118)
at $iwC$$iwC$$iwC.<init>(<console>:120)
at $iwC$$iwC.<init>(<console>:122)
at $iwC.<init>(<console>:124)
at <init>(<console>:126)
at .<init>(<console>:130)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at
org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:812)
at
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:755)
at
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:748)
at
org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:331)
at org.apache.zeppelin.scheduler.Job.run(Job.java:171)
at
org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
org.apache.mahout.math.AbstractMatrix
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 64 more
On Fri, May 20, 2016 at 3:41 PM, Andrew Musselman <
Post by Andrew Musselman
Oh might have been a browser cache issue; even after a couple hard
refresh
Post by Andrew Musselman
methods using another browser has the import link.
On Fri, May 20, 2016 at 3:36 PM, Andrew Musselman <
Post by Andrew Musselman
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the dependencies; is
that a feature in more modern zep?
Post by Dmitriy Lyubimov
no parenthesis.
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
Post by Trevor Grant
Hey Pat,
If you spit out a TSV - you can import into pyspark / matplotlib
from the
Post by Trevor Grant
resource pool in essentially the same way and use that plotting
library
if
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas and use
all of
Post by Trevor Grant
the pandas plotting as well (though I think it is for the most
part,
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
also
Post by Trevor Grant
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
In Zeppelin, unless you specify otherwise, pyspark, sparkr,
spark-sql,
and
Post by Trevor Grant
scala-spark all share the same spark context you can create RDDs in
one
Post by Trevor Grant
language and access them / work on them in another (so I
understand).
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
So in Mahout can you "save" a matrix as a RDD? e.g. something like
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
Post by Pat Ferrel
Agreed.
BTW I don’t want to stall progress but being the most ignorant of
plot
Post by Trevor Grant
Post by Pat Ferrel
libs, I’ll ask if we should consider python and matplotlib. In
another
Post by Trevor Grant
Post by Pat Ferrel
project we use python because of the RDD support on Spark though
the
Post by Trevor Grant
Post by Pat Ferrel
visualizations are extremely limited in our case. If we can pass
an RDD
Post by Trevor Grant
to
Post by Pat Ferrel
pyspark it would allow custom reductions in python before
plotting,
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
even
Post by Trevor Grant
Post by Pat Ferrel
though we will support many natively in Mahout. I’m guessing that
this
Post by Trevor Grant
Post by Pat Ferrel
would cross a context boundary and require a write to disk?
1) what does the inter language support look like with Spark
python vs
Post by Trevor Grant
Post by Pat Ferrel
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
On May 20, 2016, at 9:54 AM, Trevor Grant <
Dmitriy really nailed it on the head in his reply to the post
which
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
I'll
Post by Trevor Grant
Post by Pat Ferrel
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Trevor Grant
Post by Pat Ferrel
using Mahout is the data is to big to fit in memory. If it's to
big to
Post by Trevor Grant
fit
Post by Pat Ferrel
in memory, well then its probably too big to plot each point
(e.g.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
trillions of row, you only have so many pixels). For the
example
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
I
Post by Trevor Grant
Post by Pat Ferrel
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that
will
spit
Post by Trevor Grant
Post by Pat Ferrel
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably
worth
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
doing.
Post by Trevor Grant
Post by Pat Ferrel
There are a couple of ways to go about it. I opened up the
discussion
on
to
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
mean
Post by Trevor Grant
we
Post by Pat Ferrel
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do
that
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
preprocessing, and one that will turn something into a tsv
string.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
I have some general ideas on possible approached to making an
honest-mahout
Post by Pat Ferrel
interpreter but I want to play in the code and look at the
Flink-Mahout
Post by Trevor Grant
Post by Pat Ferrel
shell a bit before I try to organize my thoughts and present
them.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
...(2) not sure what is the point of supporting distributed
anything.
It
Post by Trevor Grant
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it in memory.
Therefore,
storage
Post by Trevor Grant
Post by Pat Ferrel
space and overplotting due to number of points. The idea is that
we
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
have
Post by Trevor Grant
to
Post by Pat Ferrel
work out algorithms that condense big data information into small
plottable
Post by Pat Ferrel
information (like density grids, for example, or histograms)....
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by Pat Ferrel
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth out the
sharp
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
edges
Post by Trevor Grant
Post by Pat Ferrel
and
any guidance from you or the Zeppelin community will be a big
help.
Post by Trevor Grant
Post by Pat Ferrel
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in
zeppelin
Post by Trevor Grant
Post by Pat Ferrel
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I
mentioned
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an
adventure in
Post by Trevor Grant
Post by Pat Ferrel
getting
Zeppelin to work right after I accidently updated to the new
snapshot
Post by Trevor Grant
Post by Pat Ferrel
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look at it
closely
Post by Trevor Grant
Post by Pat Ferrel
yet
but just a small point: I believe that you'll also need to
add the
Post by Trevor Grant
Post by Pat Ferrel
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix,
and
Post by Trevor Grant
Post by Pat Ferrel
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments
below,
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
however
Post by Pat Ferrel
with
out further ado, I present two notebooks that integrate
Mahout +
Post by Trevor Grant
Spark
Post by Pat Ferrel
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6
with
Post by Trevor Grant
Post by Pat Ferrel
sparkr
support running already, you may import the following raw
notes
Post by Trevor Grant
Post by Pat Ferrel
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
So my thoughs on next steps, which I'm positing only as a
starting
Post by Trevor Grant
Post by Pat Ferrel
point
for discussion, and are in no particular order of
- Blog on HOWTO for everyman (assumes no familiarity with
Mahout,
Post by Trevor Grant
and
Post by Pat Ferrel
only
enough familiarity with Zeppelin to have Zeppelin + SparkR
support)
Post by Trevor Grant
Post by Pat Ferrel
- Some syntactic sugar somewhere in Mahout to convert a
matrix
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
into
Post by Trevor Grant
a
Post by Pat Ferrel
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration
feels
Post by Trevor Grant
Post by Pat Ferrel
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that
Zeppelin is
Post by Trevor Grant
Post by Pat Ferrel
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding
out
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
what
Post by Trevor Grant
is
Post by Pat Ferrel
the
absolute minimum 'bare-bones' mahout we can include, e.g.
does the
Post by Trevor Grant
Post by Pat Ferrel
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how
to do
Post by Trevor Grant
the
Post by Pat Ferrel
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act
like a
Post by Trevor Grant
Post by Pat Ferrel
Mahout
interpretter
- This is taken care of by setting some env. variables,
adding
Post by Trevor Grant
some
Post by Pat Ferrel
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource
pool
Post by Trevor Grant
Post by Pat Ferrel
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you
didn't
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
have
Post by Trevor Grant
Post by Pat Ferrel
zeppelin) in R (python soon) and create a <plot package of
your
Post by Trevor Grant
Post by Pat Ferrel
choice>
To Pat's point- this is a kind of clumsy pipeline, however
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular
integration).
To
Post by Trevor Grant
Post by Pat Ferrel
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require
knowledge
Post by Trevor Grant
of R
Post by Pat Ferrel
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I
guess
the
Post by Trevor Grant
Post by Pat Ferrel
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic
sugar.
Post by Trevor Grant
Post by Pat Ferrel
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this
code)
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
Post by Pat Ferrel
Post by Pat Ferrel
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to
serialize
Post by Pat Ferrel
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
Post by Pat Ferrel
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into
thinking
Post by Trevor Grant
Post by Pat Ferrel
things
Post by Pat Ferrel
were 'working' before because I wasn't actually
integrating.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Lower
Post by Trevor Grant
I
Post by Pat Ferrel
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell
me
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
if
I'm
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object
that
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
I
can
Post by Trevor Grant
Post by Pat Ferrel
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a
dataframe
Post by Trevor Grant
Post by Pat Ferrel
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some
syntactic
Post by Trevor Grant
sugar
Post by Pat Ferrel
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin
(using
Post by Trevor Grant
Post by Pat Ferrel
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and
an R
Post by Trevor Grant
Post by Pat Ferrel
library
Post by Pat Ferrel
containing some functions which will pull the data out of
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting
package
you
Post by Trevor Grant
Post by Pat Ferrel
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Pat Ferrel
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
All of this doesn't necessarily require any changing of the
Zeppelin
Post by Pat Ferrel
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll
make
a
Post by Trevor Grant
Post by Pat Ferrel
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using
imports
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on the
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's
angularJS.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Things
Post by Pat Ferrel
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an
optional
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
build
Post by Trevor Grant
Post by Pat Ferrel
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and
expose
Post by Trevor Grant
all
Post by Pat Ferrel
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin
(which
Post by Trevor Grant
Post by Pat Ferrel
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of
examples
Post by Trevor Grant
Post by Pat Ferrel
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then
some
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're
doing
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
just
Post by Trevor Grant
as
Post by Pat Ferrel
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or
Python
Post by Trevor Grant
and
Post by Pat Ferrel
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro
(which
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
part
Post by Pat Ferrel
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local,
also
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
need
Post by Trevor Grant
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like
registering
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf
for
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc
creation
Post by Trevor Grant
code
Post by Pat Ferrel
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark
and
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
passed
Post by Pat Ferrel
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells
but I
Post by Trevor Grant
Post by Pat Ferrel
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh
Spark
Post by Trevor Grant
Post by Pat Ferrel
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with
an
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
sc.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority,
The
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
mahout
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point
we
may
Post by Trevor Grant
be
Post by Pat Ferrel
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the
mahout
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do
something
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it
(probably
Post by Trevor Grant
Post by Pat Ferrel
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc
dev or
Post by Trevor Grant
Post by Pat Ferrel
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[
https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
]<
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo>
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python,
Batchfile,
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit
skeptical
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
smile
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way
around
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
is
Post by Trevor Grant
much
Post by Pat Ferrel
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin
without
Post by Trevor Grant
Post by Pat Ferrel
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all
around
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
very
Post by Trevor Grant
Post by Pat Ferrel
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
thread.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually have
both
Post by Trevor Grant
Post by Pat Ferrel
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're
going
to
Post by Trevor Grant
Post by Pat Ferrel
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple
scatter
Post by Trevor Grant
plots
Post by Pat Ferrel
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the
pngs.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something
with
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization
should
Post by Trevor Grant
be
Post by Pat Ferrel
a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're
trying to
is
Post by Trevor Grant
Post by Pat Ferrel
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We
plan
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
on
Post by Trevor Grant
Post by Pat Ferrel
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just
reworked
Post by Trevor Grant
Post by Pat Ferrel
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and
surface
Post by Trevor Grant
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and
can
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples on the closed
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking
at
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the
new
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with
AngularJS
if
Post by Trevor Grant
Post by Pat Ferrel
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs,
maybe
Post by Trevor Grant
other
Post by Pat Ferrel
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting
permission
Post by Pat Ferrel
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I
pulled
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
in
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in
more?
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public)
this
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg
why
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
not
Post by Trevor Grant
let
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven
artifacts,
Post by Trevor Grant
Post by Pat Ferrel
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly straight
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just
to
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
show
Post by Trevor Grant
off
Post by Pat Ferrel
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth
building a
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post with
examples
Post by Trevor Grant
Post by Pat Ferrel
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much.
I've
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
done a
Post by Trevor Grant
Post by Pat Ferrel
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know
if
you
Post by Trevor Grant
Post by Pat Ferrel
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Pat Ferrel
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago.
(The
Post by Trevor Grant
old
Post by Pat Ferrel
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually
getting
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could leverage
all
of
Post by Trevor Grant
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or
anything
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
in R
Post by Trevor Grant
Post by Pat Ferrel
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as
well.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of
Mahout
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection
of
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is
that
Post by Trevor Grant
Post by Pat Ferrel
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by
nature
Post by Trevor Grant
Post by Pat Ferrel
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and
supports
Post by Trevor Grant
Post by Pat Ferrel
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support
too.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
In any case you probably know we have a Mahout version of
the
Post by Trevor Grant
Spark
Post by Pat Ferrel
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin
(code
is
Post by Trevor Grant
Post by Pat Ferrel
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
Post by Pat Ferrel
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are
interested
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I
understand
the
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS,
which I
Post by Trevor Grant
have
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to
proceed. We
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our
conversation, I
Post by Trevor Grant
Post by Pat Ferrel
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat
Ferrel
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
(Mahout
Post by Pat Ferrel
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate
your
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
help
Post by Trevor Grant
(as
Post by Pat Ferrel
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the
Mahout
Post by Trevor Grant
Post by Pat Ferrel
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org
Post by Trevor Grant
<
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all hangout
and
not
Post by Trevor Grant
Post by Pat Ferrel
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Andrew Musselman
2016-05-20 22:49:44 UTC
Permalink
Nope

Created spark context..
Spark context is available as "val sc".
Mahout distributed context is available as "implicit val sdc".
16/05/20 15:48:46 WARN ObjectStore: Version information not found in
metastore. hive.metastore.schema.verification is not enabled so recording
the schema version 0.13.1aa
SQL context available as "val sqlContext".
mahout> import org.apache.mahout.math._
import org.apache.mahout.math._
mahout> import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.scalabindings._
mahout> import org.apache.mahout.math.drm._
import org.apache.mahout.math.drm._
mahout> import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.scalabindings.RLikeOps._
mahout> import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
mahout> import org.apache.mahout.sparkbindings._
import org.apache.mahout.sparkbindings._
mahout>
mahout> implicit val sdc:
org.apache.mahout.sparkbindings.SparkDistributedContext = sc2sdc(sc)
sdc: org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Suneel Marthi
R u seeing a similar thing in plain Mahout-Spark shell too ?
On Fri, May 20, 2016 at 6:46 PM, Andrew Musselman <
Post by Andrew Musselman
Now this, definitely would help to clarify the instructions; let me know
if
Post by Andrew Musselman
I can help.
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
java.lang.NoClassDefFoundError: org/apache/mahout/math/AbstractMatrix
at
org.apache.mahout.sparkbindings.SparkDistributedContext.<init>(SparkDistributedContext.scala:25)
Post by Andrew Musselman
at org.apache.mahout.sparkbindings.package$.sc2sdc(package.scala:98)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:59)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:64)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:66)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:68)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:70)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:72)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:74)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:76)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:78)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:80)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:82)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:84)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:86)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:88)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:90)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:92)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:94)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:96)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:98)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:100)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:102)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:104)
Post by Andrew Musselman
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:106)
Post by Andrew Musselman
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:108)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:110)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:112)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:114)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:116)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:118)
at $iwC$$iwC$$iwC.<init>(<console>:120)
at $iwC$$iwC.<init>(<console>:122)
at $iwC.<init>(<console>:124)
at <init>(<console>:126)
at .<init>(<console>:130)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
Post by Andrew Musselman
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Post by Andrew Musselman
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
Post by Andrew Musselman
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
Post by Andrew Musselman
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at
org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:812)
Post by Andrew Musselman
at
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:755)
Post by Andrew Musselman
at
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:748)
Post by Andrew Musselman
at
org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
Post by Andrew Musselman
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
Post by Andrew Musselman
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:331)
Post by Andrew Musselman
at org.apache.zeppelin.scheduler.Job.run(Job.java:171)
at
org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
Post by Andrew Musselman
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
Post by Andrew Musselman
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
Post by Andrew Musselman
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Post by Andrew Musselman
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Post by Andrew Musselman
at java.lang.Thread.run(Thread.java:745)
org.apache.mahout.math.AbstractMatrix
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 64 more
On Fri, May 20, 2016 at 3:41 PM, Andrew Musselman <
Post by Andrew Musselman
Oh might have been a browser cache issue; even after a couple hard
refresh
Post by Andrew Musselman
methods using another browser has the import link.
On Fri, May 20, 2016 at 3:36 PM, Andrew Musselman <
Post by Andrew Musselman
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the dependencies;
is
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
that a feature in more modern zep?
Post by Dmitriy Lyubimov
no parenthesis.
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
Post by Trevor Grant
Hey Pat,
If you spit out a TSV - you can import into pyspark / matplotlib
from the
Post by Trevor Grant
resource pool in essentially the same way and use that plotting
library
if
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas and use
all of
Post by Trevor Grant
the pandas plotting as well (though I think it is for the most
part,
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
also
Post by Trevor Grant
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
In Zeppelin, unless you specify otherwise, pyspark, sparkr,
spark-sql,
and
Post by Trevor Grant
scala-spark all share the same spark context you can create RDDs
in
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
one
Post by Trevor Grant
language and access them / work on them in another (so I
understand).
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
So in Mahout can you "save" a matrix as a RDD? e.g. something
like
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
Post by Pat Ferrel
Agreed.
BTW I don’t want to stall progress but being the most ignorant
of
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
plot
Post by Trevor Grant
Post by Pat Ferrel
libs, I’ll ask if we should consider python and matplotlib. In
another
Post by Trevor Grant
Post by Pat Ferrel
project we use python because of the RDD support on Spark
though
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
visualizations are extremely limited in our case. If we can
pass
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
an RDD
Post by Trevor Grant
to
Post by Pat Ferrel
pyspark it would allow custom reductions in python before
plotting,
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
even
Post by Trevor Grant
Post by Pat Ferrel
though we will support many natively in Mahout. I’m guessing
that
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
this
Post by Trevor Grant
Post by Pat Ferrel
would cross a context boundary and require a write to disk?
1) what does the inter language support look like with Spark
python vs
Post by Trevor Grant
Post by Pat Ferrel
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
On May 20, 2016, at 9:54 AM, Trevor Grant <
Dmitriy really nailed it on the head in his reply to the post
which
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
I'll
Post by Trevor Grant
Post by Pat Ferrel
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Trevor Grant
Post by Pat Ferrel
using Mahout is the data is to big to fit in memory. If it's
to
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
big to
Post by Trevor Grant
fit
Post by Pat Ferrel
in memory, well then its probably too big to plot each point
(e.g.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
trillions of row, you only have so many pixels). For the
example
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
I
Post by Trevor Grant
Post by Pat Ferrel
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that
will
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that
will
spit
Post by Trevor Grant
Post by Pat Ferrel
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably
worth
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
doing.
Post by Trevor Grant
Post by Pat Ferrel
There are a couple of ways to go about it. I opened up the
discussion
on
that
Post by Andrew Musselman
to
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
mean
Post by Trevor Grant
we
Post by Pat Ferrel
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do
that
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
preprocessing, and one that will turn something into a tsv
string.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
I have some general ideas on possible approached to making an
honest-mahout
Post by Pat Ferrel
interpreter but I want to play in the code and look at the
Flink-Mahout
Post by Trevor Grant
Post by Pat Ferrel
shell a bit before I try to organize my thoughts and present
them.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
...(2) not sure what is the point of supporting distributed
anything.
It
Post by Trevor Grant
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it in memory.
Therefore,
storage
Post by Trevor Grant
Post by Pat Ferrel
space and overplotting due to number of points. The idea is
that
Post by Andrew Musselman
we
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
have
Post by Trevor Grant
to
Post by Pat Ferrel
work out algorithms that condense big data information into
small
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
plottable
Post by Pat Ferrel
information (like density grids, for example, or
histograms)....
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by Pat Ferrel
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth out the
sharp
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
edges
Post by Trevor Grant
Post by Pat Ferrel
and
any guidance from you or the Zeppelin community will be a big
help.
Post by Trevor Grant
Post by Pat Ferrel
On May 20, 2016, at 8:13 AM, Shannon Quinn <
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in
zeppelin
Post by Trevor Grant
Post by Pat Ferrel
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I
mentioned
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an
adventure in
Post by Trevor Grant
Post by Pat Ferrel
getting
Zeppelin to work right after I accidently updated to the
new
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
snapshot
Post by Trevor Grant
Post by Pat Ferrel
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look at
it
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
closely
Post by Trevor Grant
Post by Pat Ferrel
yet
but just a small point: I believe that you'll also need to
add the
Post by Trevor Grant
Post by Pat Ferrel
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion
matrix,
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
Post by Pat Ferrel
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments
below,
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
however
Post by Pat Ferrel
with
out further ado, I present two notebooks that integrate
Mahout +
Post by Trevor Grant
Spark
Post by Pat Ferrel
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin
0.6
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
Post by Pat Ferrel
sparkr
support running already, you may import the following raw
notes
Post by Trevor Grant
Post by Pat Ferrel
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
So my thoughs on next steps, which I'm positing only as a
starting
Post by Trevor Grant
Post by Pat Ferrel
point
for discussion, and are in no particular order of
- Blog on HOWTO for everyman (assumes no familiarity with
Mahout,
Post by Trevor Grant
and
Post by Pat Ferrel
only
enough familiarity with Zeppelin to have Zeppelin + SparkR
support)
Post by Trevor Grant
Post by Pat Ferrel
- Some syntactic sugar somewhere in Mahout to convert a
matrix
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
into
Post by Trevor Grant
a
Post by Pat Ferrel
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper
integration
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
feels
Post by Trevor Grant
Post by Pat Ferrel
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that
Zeppelin is
Post by Trevor Grant
Post by Pat Ferrel
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding
out
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
what
Post by Trevor Grant
is
Post by Pat Ferrel
the
absolute minimum 'bare-bones' mahout we can include, e.g.
does the
Post by Trevor Grant
Post by Pat Ferrel
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing
how
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
to do
Post by Trevor Grant
the
Post by Pat Ferrel
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to
act
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
like a
Post by Trevor Grant
Post by Pat Ferrel
Mahout
interpretter
- This is taken care of by setting some env. variables,
adding
Post by Trevor Grant
some
Post by Pat Ferrel
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a
resource
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
pool
Post by Trevor Grant
Post by Pat Ferrel
- This could be done to a disk if you didn't have
zeppelin
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
4) read the tsv from the resource pool (or disk if you
didn't
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
have
Post by Trevor Grant
Post by Pat Ferrel
zeppelin) in R (python soon) and create a <plot package of
your
Post by Trevor Grant
Post by Pat Ferrel
choice>
To Pat's point- this is a kind of clumsy pipeline, however
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular
integration).
To
Post by Trevor Grant
Post by Pat Ferrel
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require
knowledge
Post by Trevor Grant
of R
Post by Pat Ferrel
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I
guess
the
Post by Trevor Grant
Post by Pat Ferrel
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala
syntactic
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
sugar.
Post by Trevor Grant
Post by Pat Ferrel
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this
code)
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have
to
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
serialize
Post by Pat Ferrel
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
Post by Pat Ferrel
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into
thinking
Post by Trevor Grant
Post by Pat Ferrel
things
Post by Pat Ferrel
were 'working' before because I wasn't actually
integrating.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Lower
Post by Trevor Grant
I
Post by Pat Ferrel
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell
me
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
if
I'm
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object
that
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
I
can
Post by Trevor Grant
Post by Pat Ferrel
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a
dataframe
Post by Trevor Grant
Post by Pat Ferrel
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some
syntactic
Post by Trevor Grant
sugar
Post by Pat Ferrel
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to
zeppelin
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(using
Post by Trevor Grant
Post by Pat Ferrel
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink)
and
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
an R
Post by Trevor Grant
Post by Pat Ferrel
library
Post by Pat Ferrel
containing some functions which will pull the data out of
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting
package
you
Post by Trevor Grant
Post by Pat Ferrel
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Pat Ferrel
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
All of this doesn't necessarily require any changing of
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Zeppelin
Post by Pat Ferrel
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up,
I'll
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
make
a
Post by Trevor Grant
Post by Pat Ferrel
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using
imports
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's
angularJS.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Things
Post by Pat Ferrel
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an
optional
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
build
Post by Trevor Grant
Post by Pat Ferrel
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and
expose
Post by Trevor Grant
all
Post by Pat Ferrel
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin
(which
Post by Trevor Grant
Post by Pat Ferrel
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of
examples
Post by Trevor Grant
Post by Pat Ferrel
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then
some
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're
doing
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
just
Post by Trevor Grant
as
Post by Pat Ferrel
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or
Python
Post by Trevor Grant
and
Post by Pat Ferrel
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro
(which
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
part
Post by Pat Ferrel
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer
org.apache.spark.serializer.KryoSerializer
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Add artifacts (need to change these to maven not local,
also
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
need
Post by Trevor Grant
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like
registering
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf
for
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc
creation
Post by Trevor Grant
code
Post by Pat Ferrel
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark
and
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
passed
Post by Pat Ferrel
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain
cells
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
but I
Post by Trevor Grant
Post by Pat Ferrel
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than
teh
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Spark
Post by Trevor Grant
Post by Pat Ferrel
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with
an
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
sc.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority,
The
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
mahout
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some
point
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
we
may
Post by Trevor Grant
be
Post by Pat Ferrel
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the
mahout
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do
something
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it
(probably
Post by Trevor Grant
Post by Pat Ferrel
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion
on
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
that
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc
dev or
Post by Trevor Grant
Post by Pat Ferrel
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[
https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
]<
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo>
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python,
Batchfile,
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit
skeptical
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
smile
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way
around
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
is
Post by Trevor Grant
much
Post by Pat Ferrel
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in
Zeppelin
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
without
Post by Trevor Grant
Post by Pat Ferrel
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all
around
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
very
Post by Trevor Grant
Post by Pat Ferrel
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
thread.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually
have
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
both
Post by Trevor Grant
Post by Pat Ferrel
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future
we're
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
going
to
Post by Trevor Grant
Post by Pat Ferrel
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple
scatter
Post by Trevor Grant
plots
Post by Pat Ferrel
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the
pngs.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing
it...
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
OK. I'll read through the examples and try to do
something
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking
visualization
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
should
Post by Trevor Grant
be
Post by Pat Ferrel
a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage
zeppelins
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're
trying to
is
Post by Trevor Grant
Post by Pat Ferrel
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We
plan
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
on
Post by Trevor Grant
Post by Pat Ferrel
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've
just
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
reworked
Post by Trevor Grant
Post by Pat Ferrel
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid
and
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
surface
Post by Trevor Grant
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and
can
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples on the
closed
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile
as a
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are
looking
Post by Andrew Musselman
at
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the
new
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with
AngularJS
if
Post by Trevor Grant
Post by Pat Ferrel
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs,
maybe
Post by Trevor Grant
other
Post by Pat Ferrel
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble
getting
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
permission
Post by Pat Ferrel
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I
pulled
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
in
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling
in
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
more?
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public)
this
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg
why
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
not
Post by Trevor Grant
let
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven
artifacts,
Post by Trevor Grant
Post by Pat Ferrel
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the
races...
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
that said, adding a build profile like -PsparkMahout and
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly
straight
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be
more
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just
to
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
show
Post by Trevor Grant
off
Post by Pat Ferrel
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth
building a
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post
with
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
examples
Post by Trevor Grant
Post by Pat Ferrel
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much.
I've
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
done a
Post by Trevor Grant
Post by Pat Ferrel
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me
know
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
if
you
Post by Trevor Grant
Post by Pat Ferrel
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Pat Ferrel
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while
ago.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(The
Post by Trevor Grant
old
Post by Pat Ferrel
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually
getting
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could
leverage
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
all
of
Post by Trevor Grant
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or
anything
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
in R
Post by Trevor Grant
Post by Pat Ferrel
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as
well.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of
Mahout
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara
is a
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection
of
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit
is
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
that
Post by Trevor Grant
Post by Pat Ferrel
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by
nature
Post by Trevor Grant
Post by Pat Ferrel
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and
supports
Post by Trevor Grant
Post by Pat Ferrel
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support
too.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
In any case you probably know we have a Mahout version
of
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Spark
Post by Pat Ferrel
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of
Zeppelin
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(code
is
Post by Trevor Grant
Post by Pat Ferrel
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
Post by Pat Ferrel
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are
interested
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I
understand
the
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS,
which I
Post by Trevor Grant
have
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to
proceed. We
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our
conversation, I
Post by Trevor Grant
Post by Pat Ferrel
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat
Ferrel
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
(Mahout
Post by Pat Ferrel
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate
your
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
help
Post by Trevor Grant
(as
Post by Pat Ferrel
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Mahout
Post by Trevor Grant
Post by Pat Ferrel
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org
Post by Trevor Grant
<
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all
hangout
Post by Andrew Musselman
Post by Andrew Musselman
Post by Andrew Musselman
Post by Dmitriy Lyubimov
and
not
Post by Trevor Grant
Post by Pat Ferrel
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Trevor Grant
2016-05-20 23:56:20 UTC
Permalink
Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a version is
uninformative to me. I'd say if possible, you're first troubleshooting
measure would be to re clone or do a "git fetch upstream" to get up to the
very latest

Sorry for delayed reply
Tg
Post by Andrew Musselman
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the dependencies; is
that a feature in more modern zep?
Post by Dmitriy Lyubimov
no parenthesis.
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
Post by Trevor Grant
Hey Pat,
If you spit out a TSV - you can import into pyspark / matplotlib from
the
Post by Trevor Grant
resource pool in essentially the same way and use that plotting
library
Post by Dmitriy Lyubimov
if
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas and use all
of
Post by Trevor Grant
the pandas plotting as well (though I think it is for the most part,
also
Post by Trevor Grant
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
Post by Dmitriy Lyubimov
Post by Trevor Grant
In Zeppelin, unless you specify otherwise, pyspark, sparkr,
spark-sql,
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
scala-spark all share the same spark context you can create RDDs in
one
Post by Dmitriy Lyubimov
Post by Trevor Grant
language and access them / work on them in another (so I understand).
So in Mahout can you "save" a matrix as a RDD? e.g. something like
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Agreed.
BTW I don’t want to stall progress but being the most ignorant of
plot
Post by Trevor Grant
Post by Pat Ferrel
libs, I’ll ask if we should consider python and matplotlib. In
another
Post by Trevor Grant
Post by Pat Ferrel
project we use python because of the RDD support on Spark though
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
visualizations are extremely limited in our case. If we can pass an
RDD
Post by Trevor Grant
to
Post by Pat Ferrel
pyspark it would allow custom reductions in python before plotting,
even
Post by Trevor Grant
Post by Pat Ferrel
though we will support many natively in Mahout. I’m guessing that
this
Post by Trevor Grant
Post by Pat Ferrel
would cross a context boundary and require a write to disk?
1) what does the inter language support look like with Spark python
vs
Post by Trevor Grant
Post by Pat Ferrel
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
On May 20, 2016, at 9:54 AM, Trevor Grant <
Dmitriy really nailed it on the head in his reply to the post which
I'll
Post by Trevor Grant
Post by Pat Ferrel
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Trevor Grant
Post by Pat Ferrel
using Mahout is the data is to big to fit in memory. If it's to
big
Post by Dmitriy Lyubimov
to
Post by Trevor Grant
fit
Post by Pat Ferrel
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the example
I
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that
will
Post by Dmitriy Lyubimov
spit
Post by Trevor Grant
Post by Pat Ferrel
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably worth
doing.
Post by Trevor Grant
Post by Pat Ferrel
There are a couple of ways to go about it. I opened up the
discussion
Post by Dmitriy Lyubimov
on
mean
Post by Trevor Grant
we
Post by Pat Ferrel
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.
I have some general ideas on possible approached to making an
honest-mahout
Post by Pat Ferrel
interpreter but I want to play in the code and look at the
Flink-Mahout
Post by Trevor Grant
Post by Pat Ferrel
shell a bit before I try to organize my thoughts and present them.
...(2) not sure what is the point of supporting distributed
anything.
Post by Dmitriy Lyubimov
It
Post by Trevor Grant
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it in memory.
Therefore,
storage
Post by Trevor Grant
Post by Pat Ferrel
space and overplotting due to number of points. The idea is that we
have
Post by Trevor Grant
to
Post by Pat Ferrel
work out algorithms that condense big data information into small
plottable
Post by Pat Ferrel
information (like density grids, for example, or histograms)....
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by Pat Ferrel
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth out the sharp
edges
Post by Trevor Grant
Post by Pat Ferrel
and
any guidance from you or the Zeppelin community will be a big
help.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in
zeppelin
Post by Trevor Grant
Post by Pat Ferrel
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I
mentioned
Post by Dmitriy Lyubimov
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an
adventure
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
getting
Zeppelin to work right after I accidently updated to the new
snapshot
Post by Trevor Grant
Post by Pat Ferrel
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look at it
closely
Post by Trevor Grant
Post by Pat Ferrel
yet
but just a small point: I believe that you'll also need to add
the
Post by Trevor Grant
Post by Pat Ferrel
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix,
and
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below,
however
Post by Pat Ferrel
with
out further ado, I present two notebooks that integrate
Mahout +
Post by Dmitriy Lyubimov
Post by Trevor Grant
Spark
Post by Pat Ferrel
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6
with
Post by Trevor Grant
Post by Pat Ferrel
sparkr
support running already, you may import the following raw
notes
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
So my thoughs on next steps, which I'm positing only as a
starting
Post by Trevor Grant
Post by Pat Ferrel
point
- Blog on HOWTO for everyman (assumes no familiarity with
Mahout,
Post by Trevor Grant
and
Post by Pat Ferrel
only
enough familiarity with Zeppelin to have Zeppelin + SparkR
support)
Post by Trevor Grant
Post by Pat Ferrel
- Some syntactic sugar somewhere in Mahout to convert a matrix
into
Post by Trevor Grant
a
Post by Pat Ferrel
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration
feels
Post by Trevor Grant
Post by Pat Ferrel
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that
Zeppelin
Post by Dmitriy Lyubimov
is
Post by Trevor Grant
Post by Pat Ferrel
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out
what
Post by Trevor Grant
is
Post by Pat Ferrel
the
absolute minimum 'bare-bones' mahout we can include, e.g. does
the
Post by Trevor Grant
Post by Pat Ferrel
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how
to
Post by Dmitriy Lyubimov
do
Post by Trevor Grant
the
Post by Pat Ferrel
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act
like a
Post by Trevor Grant
Post by Pat Ferrel
Mahout
interpretter
- This is taken care of by setting some env. variables,
adding
Post by Trevor Grant
some
Post by Pat Ferrel
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource
pool
Post by Trevor Grant
Post by Pat Ferrel
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't
have
Post by Trevor Grant
Post by Pat Ferrel
zeppelin) in R (python soon) and create a <plot package of
your
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
choice>
To Pat's point- this is a kind of clumsy pipeline, however the
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular
integration).
Post by Dmitriy Lyubimov
To
Post by Trevor Grant
Post by Pat Ferrel
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require
knowledge
Post by Trevor Grant
of R
Post by Pat Ferrel
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I
guess
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic
sugar.
Post by Trevor Grant
Post by Pat Ferrel
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this
code)
Post by Dmitriy Lyubimov
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
Post by Pat Ferrel
Post by Pat Ferrel
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to
serialize
Post by Pat Ferrel
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
Post by Pat Ferrel
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into
thinking
Post by Trevor Grant
Post by Pat Ferrel
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating.
Lower
Post by Trevor Grant
I
Post by Pat Ferrel
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me
if
Post by Dmitriy Lyubimov
I'm
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that
I
Post by Dmitriy Lyubimov
can
Post by Trevor Grant
Post by Pat Ferrel
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a
dataframe
Post by Trevor Grant
Post by Pat Ferrel
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some
syntactic
Post by Dmitriy Lyubimov
Post by Trevor Grant
sugar
Post by Pat Ferrel
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin
(using
Post by Trevor Grant
Post by Pat Ferrel
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and
an
Post by Dmitriy Lyubimov
R
Post by Trevor Grant
Post by Pat Ferrel
library
Post by Pat Ferrel
containing some functions which will pull the data out of the
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting
package
Post by Dmitriy Lyubimov
you
Post by Trevor Grant
Post by Pat Ferrel
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Pat Ferrel
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
All of this doesn't necessarily require any changing of the
Zeppelin
Post by Pat Ferrel
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll
make
a
Post by Trevor Grant
Post by Pat Ferrel
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using
imports
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on the
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS.
Things
Post by Pat Ferrel
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional
build
Post by Trevor Grant
Post by Pat Ferrel
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and
expose
Post by Trevor Grant
all
Post by Pat Ferrel
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin
(which
Post by Trevor Grant
Post by Pat Ferrel
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of
examples
Post by Trevor Grant
Post by Pat Ferrel
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're doing
just
Post by Trevor Grant
as
Post by Pat Ferrel
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or
Python
Post by Trevor Grant
and
Post by Pat Ferrel
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which
in
Post by Dmitriy Lyubimov
Post by Trevor Grant
part
Post by Pat Ferrel
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also
need
Post by Trevor Grant
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc
creation
Post by Trevor Grant
code
Post by Pat Ferrel
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and
passed
Post by Pat Ferrel
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells
but I
Post by Trevor Grant
Post by Pat Ferrel
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh
Spark
Post by Trevor Grant
Post by Pat Ferrel
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an
sc.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The
mahout
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point
we
Post by Dmitriy Lyubimov
may
Post by Trevor Grant
be
Post by Pat Ferrel
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the
mahout
Post by Dmitriy Lyubimov
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do something
with
Post by Trevor Grant
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it
(probably
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev
or
Post by Trevor Grant
Post by Pat Ferrel
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
]<
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo>
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python,
Batchfile,
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical
smile
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around
is
Post by Dmitriy Lyubimov
Post by Trevor Grant
much
Post by Pat Ferrel
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin
without
Post by Trevor Grant
Post by Pat Ferrel
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around
very
Post by Trevor Grant
Post by Pat Ferrel
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on the
thread.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually have
both
Post by Trevor Grant
Post by Pat Ferrel
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're
going
to
Post by Trevor Grant
Post by Pat Ferrel
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple
scatter
Post by Dmitriy Lyubimov
Post by Trevor Grant
plots
Post by Pat Ferrel
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the
pngs.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something
with
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization
should
Post by Trevor Grant
be
Post by Pat Ferrel
a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're trying
to
is
Post by Trevor Grant
Post by Pat Ferrel
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan
on
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just
reworked
Post by Trevor Grant
Post by Pat Ferrel
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and
surface
Post by Trevor Grant
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples on the closed
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with
AngularJS
Post by Dmitriy Lyubimov
if
Post by Trevor Grant
Post by Pat Ferrel
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs,
maybe
Post by Dmitriy Lyubimov
Post by Trevor Grant
other
Post by Pat Ferrel
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting
permission
Post by Pat Ferrel
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I pulled
in
Post by Dmitriy Lyubimov
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in
more?
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why
not
Post by Trevor Grant
let
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven
artifacts,
Post by Trevor Grant
Post by Pat Ferrel
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly straight
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to
show
Post by Trevor Grant
off
Post by Pat Ferrel
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth
building
Post by Dmitriy Lyubimov
a
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post with
examples
Post by Trevor Grant
Post by Pat Ferrel
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've
done a
Post by Trevor Grant
Post by Pat Ferrel
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know
if
Post by Dmitriy Lyubimov
you
Post by Trevor Grant
Post by Pat Ferrel
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Trevor Grant
Post by Pat Ferrel
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago.
(The
Post by Trevor Grant
old
Post by Pat Ferrel
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could leverage
all
Post by Dmitriy Lyubimov
of
Post by Trevor Grant
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or anything
in R
Post by Trevor Grant
Post by Pat Ferrel
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of Mahout
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by
nature
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and
supports
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
Spark
Post by Pat Ferrel
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin
(code
is
Post by Trevor Grant
Post by Pat Ferrel
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
Post by Pat Ferrel
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are
interested
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I
understand
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS,
which I
Post by Dmitriy Lyubimov
Post by Trevor Grant
have
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to
proceed.
Post by Dmitriy Lyubimov
We
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our
conversation,
Post by Dmitriy Lyubimov
I
Post by Trevor Grant
Post by Pat Ferrel
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel
(Mahout
Post by Pat Ferrel
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at
Zeppelin
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your
help
Post by Trevor Grant
(as
Post by Pat Ferrel
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the
Mahout
Post by Trevor Grant
Post by Pat Ferrel
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org
Post by Trevor Grant
<
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all hangout
and
Post by Dmitriy Lyubimov
not
Post by Trevor Grant
Post by Pat Ferrel
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Trevor Grant
2016-05-26 17:17:35 UTC
Permalink
Ahh, like the "Sample From Matrix" paragraph in the notebook.

Yea that seems like a good add. If not this afternoon, I'll include it
Saturday.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Trevor, I was reading over your blog last night again- first time since
you updated. It is great!
I have one suggestion being adding in a code line on how the the sampling
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148
mxSin = drmSampleKRows(drmSin, 1000, replacement = false)
Maybe you omitted this intentionally?
Andy
________________________________________
Sent: Friday, May 20, 2016 7:56:20 PM
Subject: Re: Future Mahout - Zeppelin work
Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a version is
uninformative to me. I'd say if possible, you're first troubleshooting
measure would be to re clone or do a "git fetch upstream" to get up to the
very latest
Sorry for delayed reply
Tg
Post by Andrew Musselman
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the dependencies; is
that a feature in more modern zep?
Post by Dmitriy Lyubimov
no parenthesis.
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
Post by Trevor Grant
Hey Pat,
If you spit out a TSV - you can import into pyspark / matplotlib
from
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
resource pool in essentially the same way and use that plotting
library
Post by Dmitriy Lyubimov
if
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas and use
all
Post by Andrew Musselman
Post by Dmitriy Lyubimov
of
Post by Trevor Grant
the pandas plotting as well (though I think it is for the most
part,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
also
Post by Trevor Grant
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
In Zeppelin, unless you specify otherwise, pyspark, sparkr,
spark-sql,
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
scala-spark all share the same spark context you can create RDDs in
one
Post by Dmitriy Lyubimov
Post by Trevor Grant
language and access them / work on them in another (so I
understand).
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
So in Mahout can you "save" a matrix as a RDD? e.g. something like
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Dmitriy Lyubimov
Post by Trevor Grant
On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
Post by Pat Ferrel
Agreed.
BTW I don’t want to stall progress but being the most ignorant of
plot
Post by Trevor Grant
Post by Pat Ferrel
libs, I’ll ask if we should consider python and matplotlib. In
another
Post by Trevor Grant
Post by Pat Ferrel
project we use python because of the RDD support on Spark though
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
visualizations are extremely limited in our case. If we can pass
an
Post by Andrew Musselman
Post by Dmitriy Lyubimov
RDD
Post by Trevor Grant
to
Post by Pat Ferrel
pyspark it would allow custom reductions in python before
plotting,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
even
Post by Trevor Grant
Post by Pat Ferrel
though we will support many natively in Mahout. I’m guessing that
this
Post by Trevor Grant
Post by Pat Ferrel
would cross a context boundary and require a write to disk?
1) what does the inter language support look like with Spark
python
Post by Andrew Musselman
Post by Dmitriy Lyubimov
vs
Post by Trevor Grant
Post by Pat Ferrel
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
On May 20, 2016, at 9:54 AM, Trevor Grant <
Dmitriy really nailed it on the head in his reply to the post
which
Post by Andrew Musselman
Post by Dmitriy Lyubimov
I'll
Post by Trevor Grant
Post by Pat Ferrel
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Trevor Grant
Post by Pat Ferrel
using Mahout is the data is to big to fit in memory. If it's to
big
Post by Dmitriy Lyubimov
to
Post by Trevor Grant
fit
Post by Pat Ferrel
in memory, well then its probably too big to plot each point
(e.g.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
trillions of row, you only have so many pixels). For the
example
Post by Andrew Musselman
I
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that
will
Post by Dmitriy Lyubimov
spit
Post by Trevor Grant
Post by Pat Ferrel
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably
worth
Post by Andrew Musselman
Post by Dmitriy Lyubimov
doing.
Post by Trevor Grant
Post by Pat Ferrel
There are a couple of ways to go about it. I opened up the
discussion
Post by Dmitriy Lyubimov
on
to
Post by Andrew Musselman
Post by Dmitriy Lyubimov
mean
Post by Trevor Grant
we
Post by Pat Ferrel
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do
that
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
preprocessing, and one that will turn something into a tsv
string.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
I have some general ideas on possible approached to making an
honest-mahout
Post by Pat Ferrel
interpreter but I want to play in the code and look at the
Flink-Mahout
Post by Trevor Grant
Post by Pat Ferrel
shell a bit before I try to organize my thoughts and present
them.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
...(2) not sure what is the point of supporting distributed
anything.
Post by Dmitriy Lyubimov
It
Post by Trevor Grant
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it in memory.
Therefore,
storage
Post by Trevor Grant
Post by Pat Ferrel
space and overplotting due to number of points. The idea is that
we
Post by Andrew Musselman
Post by Dmitriy Lyubimov
have
Post by Trevor Grant
to
Post by Pat Ferrel
work out algorithms that condense big data information into small
plottable
Post by Pat Ferrel
information (like density grids, for example, or histograms)....
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by Pat Ferrel
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth out the
sharp
Post by Andrew Musselman
Post by Dmitriy Lyubimov
edges
Post by Trevor Grant
Post by Pat Ferrel
and
any guidance from you or the Zeppelin community will be a big
help.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in
zeppelin
Post by Trevor Grant
Post by Pat Ferrel
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I
mentioned
Post by Dmitriy Lyubimov
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an
adventure
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
getting
Zeppelin to work right after I accidently updated to the new
snapshot
Post by Trevor Grant
Post by Pat Ferrel
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look at it
closely
Post by Trevor Grant
Post by Pat Ferrel
yet
but just a small point: I believe that you'll also need to
add
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix,
and
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments
below,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
however
Post by Pat Ferrel
with
out further ado, I present two notebooks that integrate
Mahout +
Post by Dmitriy Lyubimov
Post by Trevor Grant
Spark
Post by Pat Ferrel
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6
with
Post by Trevor Grant
Post by Pat Ferrel
sparkr
support running already, you may import the following raw
notes
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
So my thoughs on next steps, which I'm positing only as a
starting
Post by Trevor Grant
Post by Pat Ferrel
point
for discussion, and are in no particular order of
- Blog on HOWTO for everyman (assumes no familiarity with
Mahout,
Post by Trevor Grant
and
Post by Pat Ferrel
only
enough familiarity with Zeppelin to have Zeppelin + SparkR
support)
Post by Trevor Grant
Post by Pat Ferrel
- Some syntactic sugar somewhere in Mahout to convert a
matrix
Post by Andrew Musselman
Post by Dmitriy Lyubimov
into
Post by Trevor Grant
a
Post by Pat Ferrel
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration
feels
Post by Trevor Grant
Post by Pat Ferrel
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that
Zeppelin
Post by Dmitriy Lyubimov
is
Post by Trevor Grant
Post by Pat Ferrel
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding
out
Post by Andrew Musselman
Post by Dmitriy Lyubimov
what
Post by Trevor Grant
is
Post by Pat Ferrel
the
absolute minimum 'bare-bones' mahout we can include, e.g.
does
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how
to
Post by Dmitriy Lyubimov
do
Post by Trevor Grant
the
Post by Pat Ferrel
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act
like a
Post by Trevor Grant
Post by Pat Ferrel
Mahout
interpretter
- This is taken care of by setting some env. variables,
adding
Post by Trevor Grant
some
Post by Pat Ferrel
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource
pool
Post by Trevor Grant
Post by Pat Ferrel
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you
didn't
Post by Andrew Musselman
Post by Dmitriy Lyubimov
have
Post by Trevor Grant
Post by Pat Ferrel
zeppelin) in R (python soon) and create a <plot package of
your
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
choice>
To Pat's point- this is a kind of clumsy pipeline, however
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular
integration).
Post by Dmitriy Lyubimov
To
Post by Trevor Grant
Post by Pat Ferrel
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require
knowledge
Post by Trevor Grant
of R
Post by Pat Ferrel
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I
guess
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic
sugar.
Post by Trevor Grant
Post by Pat Ferrel
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this
code)
Post by Dmitriy Lyubimov
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
Post by Pat Ferrel
Post by Pat Ferrel
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to
serialize
Post by Pat Ferrel
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
Post by Pat Ferrel
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into
thinking
Post by Trevor Grant
Post by Pat Ferrel
things
Post by Pat Ferrel
were 'working' before because I wasn't actually
integrating.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Lower
Post by Trevor Grant
I
Post by Pat Ferrel
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell
me
Post by Andrew Musselman
if
Post by Dmitriy Lyubimov
I'm
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object
that
Post by Andrew Musselman
I
Post by Dmitriy Lyubimov
can
Post by Trevor Grant
Post by Pat Ferrel
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a
dataframe
Post by Trevor Grant
Post by Pat Ferrel
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some
syntactic
Post by Dmitriy Lyubimov
Post by Trevor Grant
sugar
Post by Pat Ferrel
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin
(using
Post by Trevor Grant
Post by Pat Ferrel
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and
an
Post by Dmitriy Lyubimov
R
Post by Trevor Grant
Post by Pat Ferrel
library
Post by Pat Ferrel
containing some functions which will pull the data out of
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting
package
Post by Dmitriy Lyubimov
you
Post by Trevor Grant
Post by Pat Ferrel
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Pat Ferrel
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
All of this doesn't necessarily require any changing of the
Zeppelin
Post by Pat Ferrel
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll
make
a
Post by Trevor Grant
Post by Pat Ferrel
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using
imports
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on the
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's
angularJS.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Things
Post by Pat Ferrel
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an
optional
Post by Andrew Musselman
Post by Dmitriy Lyubimov
build
Post by Trevor Grant
Post by Pat Ferrel
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and
expose
Post by Trevor Grant
all
Post by Pat Ferrel
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin
(which
Post by Trevor Grant
Post by Pat Ferrel
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of
examples
Post by Trevor Grant
Post by Pat Ferrel
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then
some
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're
doing
Post by Andrew Musselman
Post by Dmitriy Lyubimov
just
Post by Trevor Grant
as
Post by Pat Ferrel
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or
Python
Post by Trevor Grant
and
Post by Pat Ferrel
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro
(which
Post by Andrew Musselman
in
Post by Dmitriy Lyubimov
Post by Trevor Grant
part
Post by Pat Ferrel
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local,
also
Post by Andrew Musselman
Post by Dmitriy Lyubimov
need
Post by Trevor Grant
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like
registering
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf
for
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc
creation
Post by Trevor Grant
code
Post by Pat Ferrel
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark
and
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
passed
Post by Pat Ferrel
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells
but I
Post by Trevor Grant
Post by Pat Ferrel
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh
Spark
Post by Trevor Grant
Post by Pat Ferrel
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with
an
Post by Andrew Musselman
Post by Dmitriy Lyubimov
sc.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority,
The
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
mahout
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point
we
Post by Dmitriy Lyubimov
may
Post by Trevor Grant
be
Post by Pat Ferrel
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the
mahout
Post by Dmitriy Lyubimov
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do
something
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it
(probably
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc
dev
Post by Andrew Musselman
Post by Dmitriy Lyubimov
or
Post by Trevor Grant
Post by Pat Ferrel
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[
https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
Post by Andrew Musselman
]<
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo>
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python,
Batchfile,
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit
skeptical
Post by Andrew Musselman
Post by Dmitriy Lyubimov
smile
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way
around
Post by Andrew Musselman
is
Post by Dmitriy Lyubimov
Post by Trevor Grant
much
Post by Pat Ferrel
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin
without
Post by Trevor Grant
Post by Pat Ferrel
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all
around
Post by Andrew Musselman
Post by Dmitriy Lyubimov
very
Post by Trevor Grant
Post by Pat Ferrel
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
thread.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually have
both
Post by Trevor Grant
Post by Pat Ferrel
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're
going
to
Post by Trevor Grant
Post by Pat Ferrel
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple
scatter
Post by Dmitriy Lyubimov
Post by Trevor Grant
plots
Post by Pat Ferrel
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the
pngs.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something
with
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization
should
Post by Trevor Grant
be
Post by Pat Ferrel
a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're
trying
Post by Andrew Musselman
Post by Dmitriy Lyubimov
to
is
Post by Trevor Grant
Post by Pat Ferrel
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We
plan
Post by Andrew Musselman
on
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just
reworked
Post by Trevor Grant
Post by Pat Ferrel
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and
surface
Post by Trevor Grant
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and
can
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples on the closed
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking
at
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the
new
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with
AngularJS
Post by Dmitriy Lyubimov
if
Post by Trevor Grant
Post by Pat Ferrel
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs,
maybe
Post by Dmitriy Lyubimov
Post by Trevor Grant
other
Post by Pat Ferrel
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting
permission
Post by Pat Ferrel
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I
pulled
Post by Andrew Musselman
in
Post by Dmitriy Lyubimov
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in
more?
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public)
this
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg
why
Post by Andrew Musselman
Post by Dmitriy Lyubimov
not
Post by Trevor Grant
let
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven
artifacts,
Post by Trevor Grant
Post by Pat Ferrel
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly straight
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just
to
Post by Andrew Musselman
Post by Dmitriy Lyubimov
show
Post by Trevor Grant
off
Post by Pat Ferrel
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth
building
Post by Dmitriy Lyubimov
a
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post with
examples
Post by Trevor Grant
Post by Pat Ferrel
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much.
I've
Post by Andrew Musselman
Post by Dmitriy Lyubimov
done a
Post by Trevor Grant
Post by Pat Ferrel
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know
if
Post by Dmitriy Lyubimov
you
Post by Trevor Grant
Post by Pat Ferrel
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Trevor Grant
Post by Pat Ferrel
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago.
(The
Post by Trevor Grant
old
Post by Pat Ferrel
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually
getting
Post by Andrew Musselman
Post by Dmitriy Lyubimov
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could leverage
all
Post by Dmitriy Lyubimov
of
Post by Trevor Grant
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or
anything
Post by Andrew Musselman
Post by Dmitriy Lyubimov
in R
Post by Trevor Grant
Post by Pat Ferrel
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as
well.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of
Mahout
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection
of
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by
nature
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and
supports
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support
too.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
In any case you probably know we have a Mahout version of
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
Spark
Post by Pat Ferrel
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin
(code
is
Post by Trevor Grant
Post by Pat Ferrel
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
Post by Pat Ferrel
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are
interested
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I
understand
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS,
which I
Post by Dmitriy Lyubimov
Post by Trevor Grant
have
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to
proceed.
Post by Dmitriy Lyubimov
We
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our
conversation,
Post by Dmitriy Lyubimov
I
Post by Trevor Grant
Post by Pat Ferrel
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat
Ferrel
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
(Mahout
Post by Pat Ferrel
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at
Zeppelin
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate
your
Post by Andrew Musselman
Post by Dmitriy Lyubimov
help
Post by Trevor Grant
(as
Post by Pat Ferrel
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the
Mahout
Post by Trevor Grant
Post by Pat Ferrel
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org
Post by Trevor Grant
<
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all hangout
and
Post by Dmitriy Lyubimov
not
Post by Trevor Grant
Post by Pat Ferrel
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Suneel Marthi
2016-05-26 18:22:47 UTC
Permalink
While on this subject, do we have a plan yet of integrating Zeppelin into
Mahout (or the converse) of having Mahout specific interpreter for
Zeppelin? I think that shuld be high priority in the short term.
Post by Trevor Grant
Ahh, like the "Sample From Matrix" paragraph in the notebook.
Yea that seems like a good add. If not this afternoon, I'll include it
Saturday.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Trevor, I was reading over your blog last night again- first time since
you updated. It is great!
I have one suggestion being adding in a code line on how the the sampling
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148
mxSin = drmSampleKRows(drmSin, 1000, replacement = false)
Maybe you omitted this intentionally?
Andy
________________________________________
Sent: Friday, May 20, 2016 7:56:20 PM
Subject: Re: Future Mahout - Zeppelin work
Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a version
is
uninformative to me. I'd say if possible, you're first troubleshooting
measure would be to re clone or do a "git fetch upstream" to get up to
the
very latest
Sorry for delayed reply
Tg
Post by Andrew Musselman
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the dependencies;
is
Post by Andrew Musselman
that a feature in more modern zep?
Post by Dmitriy Lyubimov
no parenthesis.
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
Post by Trevor Grant
Hey Pat,
If you spit out a TSV - you can import into pyspark / matplotlib
from
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
resource pool in essentially the same way and use that plotting
library
Post by Dmitriy Lyubimov
if
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas and use
all
Post by Andrew Musselman
Post by Dmitriy Lyubimov
of
Post by Trevor Grant
the pandas plotting as well (though I think it is for the most
part,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
also
Post by Trevor Grant
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
In Zeppelin, unless you specify otherwise, pyspark, sparkr,
spark-sql,
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
scala-spark all share the same spark context you can create RDDs
in
Post by Andrew Musselman
one
Post by Dmitriy Lyubimov
Post by Trevor Grant
language and access them / work on them in another (so I
understand).
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
So in Mahout can you "save" a matrix as a RDD? e.g. something
like
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Dmitriy Lyubimov
Post by Trevor Grant
On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
Post by Pat Ferrel
Agreed.
BTW I don’t want to stall progress but being the most ignorant
of
Post by Andrew Musselman
Post by Dmitriy Lyubimov
plot
Post by Trevor Grant
Post by Pat Ferrel
libs, I’ll ask if we should consider python and matplotlib. In
another
Post by Trevor Grant
Post by Pat Ferrel
project we use python because of the RDD support on Spark
though
Post by Andrew Musselman
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
visualizations are extremely limited in our case. If we can
pass
an
Post by Andrew Musselman
Post by Dmitriy Lyubimov
RDD
Post by Trevor Grant
to
Post by Pat Ferrel
pyspark it would allow custom reductions in python before
plotting,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
even
Post by Trevor Grant
Post by Pat Ferrel
though we will support many natively in Mahout. I’m guessing
that
Post by Andrew Musselman
Post by Dmitriy Lyubimov
this
Post by Trevor Grant
Post by Pat Ferrel
would cross a context boundary and require a write to disk?
1) what does the inter language support look like with Spark
python
Post by Andrew Musselman
Post by Dmitriy Lyubimov
vs
Post by Trevor Grant
Post by Pat Ferrel
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
On May 20, 2016, at 9:54 AM, Trevor Grant <
Dmitriy really nailed it on the head in his reply to the post
which
Post by Andrew Musselman
Post by Dmitriy Lyubimov
I'll
Post by Trevor Grant
Post by Pat Ferrel
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Trevor Grant
Post by Pat Ferrel
using Mahout is the data is to big to fit in memory. If it's
to
Post by Andrew Musselman
big
Post by Dmitriy Lyubimov
to
Post by Trevor Grant
fit
Post by Pat Ferrel
in memory, well then its probably too big to plot each point
(e.g.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
trillions of row, you only have so many pixels). For the
example
Post by Andrew Musselman
I
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that
will
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that
will
Post by Dmitriy Lyubimov
spit
Post by Trevor Grant
Post by Pat Ferrel
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably
worth
Post by Andrew Musselman
Post by Dmitriy Lyubimov
doing.
Post by Trevor Grant
Post by Pat Ferrel
There are a couple of ways to go about it. I opened up the
discussion
Post by Dmitriy Lyubimov
on
that
to
Post by Andrew Musselman
Post by Dmitriy Lyubimov
mean
Post by Trevor Grant
we
Post by Pat Ferrel
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do
that
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
preprocessing, and one that will turn something into a tsv
string.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
I have some general ideas on possible approached to making an
honest-mahout
Post by Pat Ferrel
interpreter but I want to play in the code and look at the
Flink-Mahout
Post by Trevor Grant
Post by Pat Ferrel
shell a bit before I try to organize my thoughts and present
them.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
...(2) not sure what is the point of supporting distributed
anything.
Post by Dmitriy Lyubimov
It
Post by Trevor Grant
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it in memory.
Therefore,
storage
Post by Trevor Grant
Post by Pat Ferrel
space and overplotting due to number of points. The idea is
that
we
Post by Andrew Musselman
Post by Dmitriy Lyubimov
have
Post by Trevor Grant
to
Post by Pat Ferrel
work out algorithms that condense big data information into
small
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
plottable
Post by Pat Ferrel
information (like density grids, for example, or
histograms)....
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by Pat Ferrel
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth out the
sharp
Post by Andrew Musselman
Post by Dmitriy Lyubimov
edges
Post by Trevor Grant
Post by Pat Ferrel
and
any guidance from you or the Zeppelin community will be a big
help.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
On May 20, 2016, at 8:13 AM, Shannon Quinn <
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in
zeppelin
Post by Trevor Grant
Post by Pat Ferrel
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I
mentioned
Post by Dmitriy Lyubimov
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an
adventure
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
getting
Zeppelin to work right after I accidently updated to the
new
Post by Andrew Musselman
Post by Dmitriy Lyubimov
snapshot
Post by Trevor Grant
Post by Pat Ferrel
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look at
it
Post by Andrew Musselman
Post by Dmitriy Lyubimov
closely
Post by Trevor Grant
Post by Pat Ferrel
yet
but just a small point: I believe that you'll also need to
add
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion
matrix,
Post by Andrew Musselman
and
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments
below,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
however
Post by Pat Ferrel
with
out further ado, I present two notebooks that integrate
Mahout +
Post by Dmitriy Lyubimov
Post by Trevor Grant
Spark
Post by Pat Ferrel
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin
0.6
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
Post by Pat Ferrel
sparkr
support running already, you may import the following raw
notes
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
So my thoughs on next steps, which I'm positing only as a
starting
Post by Trevor Grant
Post by Pat Ferrel
point
for discussion, and are in no particular order of
- Blog on HOWTO for everyman (assumes no familiarity with
Mahout,
Post by Trevor Grant
and
Post by Pat Ferrel
only
enough familiarity with Zeppelin to have Zeppelin + SparkR
support)
Post by Trevor Grant
Post by Pat Ferrel
- Some syntactic sugar somewhere in Mahout to convert a
matrix
Post by Andrew Musselman
Post by Dmitriy Lyubimov
into
Post by Trevor Grant
a
Post by Pat Ferrel
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper
integration
Post by Andrew Musselman
Post by Dmitriy Lyubimov
feels
Post by Trevor Grant
Post by Pat Ferrel
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that
Zeppelin
Post by Dmitriy Lyubimov
is
Post by Trevor Grant
Post by Pat Ferrel
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding
out
Post by Andrew Musselman
Post by Dmitriy Lyubimov
what
Post by Trevor Grant
is
Post by Pat Ferrel
the
absolute minimum 'bare-bones' mahout we can include, e.g.
does
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing
how
Post by Andrew Musselman
to
Post by Dmitriy Lyubimov
do
Post by Trevor Grant
the
Post by Pat Ferrel
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to
act
Post by Andrew Musselman
Post by Dmitriy Lyubimov
like a
Post by Trevor Grant
Post by Pat Ferrel
Mahout
interpretter
- This is taken care of by setting some env. variables,
adding
Post by Trevor Grant
some
Post by Pat Ferrel
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a
resource
Post by Andrew Musselman
Post by Dmitriy Lyubimov
pool
Post by Trevor Grant
Post by Pat Ferrel
- This could be done to a disk if you didn't have
zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
4) read the tsv from the resource pool (or disk if you
didn't
Post by Andrew Musselman
Post by Dmitriy Lyubimov
have
Post by Trevor Grant
Post by Pat Ferrel
zeppelin) in R (python soon) and create a <plot package of
your
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
choice>
To Pat's point- this is a kind of clumsy pipeline, however
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular
integration).
Post by Dmitriy Lyubimov
To
Post by Trevor Grant
Post by Pat Ferrel
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require
knowledge
Post by Trevor Grant
of R
Post by Pat Ferrel
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I
guess
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala
syntactic
Post by Andrew Musselman
Post by Dmitriy Lyubimov
sugar.
Post by Trevor Grant
Post by Pat Ferrel
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this
code)
Post by Dmitriy Lyubimov
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have
to
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
serialize
Post by Pat Ferrel
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
Post by Pat Ferrel
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into
thinking
Post by Trevor Grant
Post by Pat Ferrel
things
Post by Pat Ferrel
were 'working' before because I wasn't actually
integrating.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Lower
Post by Trevor Grant
I
Post by Pat Ferrel
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell
me
Post by Andrew Musselman
if
Post by Dmitriy Lyubimov
I'm
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object
that
Post by Andrew Musselman
I
Post by Dmitriy Lyubimov
can
Post by Trevor Grant
Post by Pat Ferrel
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a
dataframe
Post by Trevor Grant
Post by Pat Ferrel
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some
syntactic
Post by Dmitriy Lyubimov
Post by Trevor Grant
sugar
Post by Pat Ferrel
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to
zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(using
Post by Trevor Grant
Post by Pat Ferrel
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink)
and
Post by Andrew Musselman
an
Post by Dmitriy Lyubimov
R
Post by Trevor Grant
Post by Pat Ferrel
library
Post by Pat Ferrel
containing some functions which will pull the data out of
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting
package
Post by Dmitriy Lyubimov
you
Post by Trevor Grant
Post by Pat Ferrel
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Pat Ferrel
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
All of this doesn't necessarily require any changing of
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Zeppelin
Post by Pat Ferrel
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up,
I'll
Post by Andrew Musselman
Post by Dmitriy Lyubimov
make
a
Post by Trevor Grant
Post by Pat Ferrel
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using
imports
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's
angularJS.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Things
Post by Pat Ferrel
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an
optional
Post by Andrew Musselman
Post by Dmitriy Lyubimov
build
Post by Trevor Grant
Post by Pat Ferrel
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and
expose
Post by Trevor Grant
all
Post by Pat Ferrel
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin
(which
Post by Trevor Grant
Post by Pat Ferrel
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of
examples
Post by Trevor Grant
Post by Pat Ferrel
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then
some
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're
doing
Post by Andrew Musselman
Post by Dmitriy Lyubimov
just
Post by Trevor Grant
as
Post by Pat Ferrel
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or
Python
Post by Trevor Grant
and
Post by Pat Ferrel
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro
(which
Post by Andrew Musselman
in
Post by Dmitriy Lyubimov
Post by Trevor Grant
part
Post by Pat Ferrel
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer
org.apache.spark.serializer.KryoSerializer
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Add artifacts (need to change these to maven not local,
also
Post by Andrew Musselman
Post by Dmitriy Lyubimov
need
Post by Trevor Grant
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like
registering
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf
for
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc
creation
Post by Trevor Grant
code
Post by Pat Ferrel
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark
and
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
passed
Post by Pat Ferrel
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain
cells
Post by Andrew Musselman
Post by Dmitriy Lyubimov
but I
Post by Trevor Grant
Post by Pat Ferrel
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than
teh
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Spark
Post by Trevor Grant
Post by Pat Ferrel
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with
an
Post by Andrew Musselman
Post by Dmitriy Lyubimov
sc.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority,
The
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
mahout
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some
point
Post by Andrew Musselman
we
Post by Dmitriy Lyubimov
may
Post by Trevor Grant
be
Post by Pat Ferrel
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the
mahout
Post by Dmitriy Lyubimov
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do
something
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it
(probably
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion
on
Post by Andrew Musselman
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc
dev
Post by Andrew Musselman
Post by Dmitriy Lyubimov
or
Post by Trevor Grant
Post by Pat Ferrel
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[
https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
Post by Andrew Musselman
]<
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo>
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python,
Batchfile,
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit
skeptical
Post by Andrew Musselman
Post by Dmitriy Lyubimov
smile
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way
around
Post by Andrew Musselman
is
Post by Dmitriy Lyubimov
Post by Trevor Grant
much
Post by Pat Ferrel
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in
Zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
without
Post by Trevor Grant
Post by Pat Ferrel
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all
around
Post by Andrew Musselman
Post by Dmitriy Lyubimov
very
Post by Trevor Grant
Post by Pat Ferrel
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
thread.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually
have
Post by Andrew Musselman
Post by Dmitriy Lyubimov
both
Post by Trevor Grant
Post by Pat Ferrel
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future
we're
Post by Andrew Musselman
Post by Dmitriy Lyubimov
going
to
Post by Trevor Grant
Post by Pat Ferrel
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple
scatter
Post by Dmitriy Lyubimov
Post by Trevor Grant
plots
Post by Pat Ferrel
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the
pngs.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing
it...
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
OK. I'll read through the examples and try to do
something
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking
visualization
Post by Andrew Musselman
Post by Dmitriy Lyubimov
should
Post by Trevor Grant
be
Post by Pat Ferrel
a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage
zeppelins
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're
trying
Post by Andrew Musselman
Post by Dmitriy Lyubimov
to
is
Post by Trevor Grant
Post by Pat Ferrel
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We
plan
Post by Andrew Musselman
on
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've
just
Post by Andrew Musselman
Post by Dmitriy Lyubimov
reworked
Post by Trevor Grant
Post by Pat Ferrel
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid
and
Post by Andrew Musselman
Post by Dmitriy Lyubimov
surface
Post by Trevor Grant
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and
can
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples on the
closed
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile
as a
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are
looking
at
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the
new
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with
AngularJS
Post by Dmitriy Lyubimov
if
Post by Trevor Grant
Post by Pat Ferrel
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs,
maybe
Post by Dmitriy Lyubimov
Post by Trevor Grant
other
Post by Pat Ferrel
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble
getting
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
permission
Post by Pat Ferrel
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I
pulled
Post by Andrew Musselman
in
Post by Dmitriy Lyubimov
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling
in
Post by Andrew Musselman
Post by Dmitriy Lyubimov
more?
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public)
this
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg
why
Post by Andrew Musselman
Post by Dmitriy Lyubimov
not
Post by Trevor Grant
let
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven
artifacts,
Post by Trevor Grant
Post by Pat Ferrel
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the
races...
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
that said, adding a build profile like -PsparkMahout and
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly
straight
Post by Andrew Musselman
Post by Dmitriy Lyubimov
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be
more
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just
to
Post by Andrew Musselman
Post by Dmitriy Lyubimov
show
Post by Trevor Grant
off
Post by Pat Ferrel
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth
building
Post by Dmitriy Lyubimov
a
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post
with
Post by Andrew Musselman
Post by Dmitriy Lyubimov
examples
Post by Trevor Grant
Post by Pat Ferrel
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much.
I've
Post by Andrew Musselman
Post by Dmitriy Lyubimov
done a
Post by Trevor Grant
Post by Pat Ferrel
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me
know
Post by Andrew Musselman
if
Post by Dmitriy Lyubimov
you
Post by Trevor Grant
Post by Pat Ferrel
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Trevor Grant
Post by Pat Ferrel
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while
ago.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(The
Post by Trevor Grant
old
Post by Pat Ferrel
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually
getting
Post by Andrew Musselman
Post by Dmitriy Lyubimov
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could
leverage
Post by Andrew Musselman
all
Post by Dmitriy Lyubimov
of
Post by Trevor Grant
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or
anything
Post by Andrew Musselman
Post by Dmitriy Lyubimov
in R
Post by Trevor Grant
Post by Pat Ferrel
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as
well.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of
Mahout
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara
is a
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection
of
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit
is
Post by Andrew Musselman
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by
nature
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and
supports
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support
too.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
In any case you probably know we have a Mahout version
of
Post by Andrew Musselman
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
Spark
Post by Pat Ferrel
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of
Zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(code
is
Post by Trevor Grant
Post by Pat Ferrel
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
Post by Pat Ferrel
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are
interested
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I
understand
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS,
which I
Post by Dmitriy Lyubimov
Post by Trevor Grant
have
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to
proceed.
Post by Dmitriy Lyubimov
We
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our
conversation,
Post by Dmitriy Lyubimov
I
Post by Trevor Grant
Post by Pat Ferrel
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat
Ferrel
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
(Mahout
Post by Pat Ferrel
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at
Zeppelin
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate
your
Post by Andrew Musselman
Post by Dmitriy Lyubimov
help
Post by Trevor Grant
(as
Post by Pat Ferrel
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Mahout
Post by Trevor Grant
Post by Pat Ferrel
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org
Post by Trevor Grant
<
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all
hangout
Post by Andrew Musselman
and
Post by Dmitriy Lyubimov
not
Post by Trevor Grant
Post by Pat Ferrel
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Trevor Grant
2016-05-26 19:17:22 UTC
Permalink
Short answer: it is high priority. I think it will be a Mahout interpreter
into Zeppelin, and given that plans are on hold for a Flink-Mahout in the
short term, I think it should be a piggy-back spark interpreter (e.g.
exposed through something like %spark.mahout). So I have thoughts, but no
plan. Been busy with a couple of other commitments.

On the Mahout side we need:
A function that will convert small matrices into TSV strings
Convenience functions for sampling super-large matrices into things like
histograms, etc, that one would want to plot. I.e. histogram bucketing?
(less important for the moment)

On the Zeppelin Size we need:
an interpreter.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Suneel Marthi
While on this subject, do we have a plan yet of integrating Zeppelin into
Mahout (or the converse) of having Mahout specific interpreter for
Zeppelin? I think that shuld be high priority in the short term.
Post by Trevor Grant
Ahh, like the "Sample From Matrix" paragraph in the notebook.
Yea that seems like a good add. If not this afternoon, I'll include it
Saturday.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Trevor, I was reading over your blog last night again- first time since
you updated. It is great!
I have one suggestion being adding in a code line on how the the
sampling
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148
Post by Trevor Grant
mxSin = drmSampleKRows(drmSin, 1000, replacement = false)
Maybe you omitted this intentionally?
Andy
________________________________________
Sent: Friday, May 20, 2016 7:56:20 PM
Subject: Re: Future Mahout - Zeppelin work
Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a version
is
uninformative to me. I'd say if possible, you're first troubleshooting
measure would be to re clone or do a "git fetch upstream" to get up to
the
very latest
Sorry for delayed reply
Tg
On May 20, 2016 5:36 PM, "Andrew Musselman" <
Post by Andrew Musselman
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the dependencies;
is
Post by Andrew Musselman
that a feature in more modern zep?
Post by Dmitriy Lyubimov
no parenthesis.
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
Post by Trevor Grant
Hey Pat,
If you spit out a TSV - you can import into pyspark /
matplotlib
Post by Trevor Grant
from
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
resource pool in essentially the same way and use that plotting
library
Post by Dmitriy Lyubimov
if
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas and
use
Post by Trevor Grant
all
Post by Andrew Musselman
Post by Dmitriy Lyubimov
of
Post by Trevor Grant
the pandas plotting as well (though I think it is for the most
part,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
also
Post by Trevor Grant
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
In Zeppelin, unless you specify otherwise, pyspark, sparkr,
spark-sql,
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
scala-spark all share the same spark context you can create
RDDs
Post by Trevor Grant
in
Post by Andrew Musselman
one
Post by Dmitriy Lyubimov
Post by Trevor Grant
language and access them / work on them in another (so I
understand).
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
So in Mahout can you "save" a matrix as a RDD? e.g. something
like
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Dmitriy Lyubimov
Post by Trevor Grant
On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
Post by Pat Ferrel
Agreed.
BTW I don’t want to stall progress but being the most
ignorant
Post by Trevor Grant
of
Post by Andrew Musselman
Post by Dmitriy Lyubimov
plot
Post by Trevor Grant
Post by Pat Ferrel
libs, I’ll ask if we should consider python and matplotlib.
In
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
another
Post by Trevor Grant
Post by Pat Ferrel
project we use python because of the RDD support on Spark
though
Post by Andrew Musselman
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
visualizations are extremely limited in our case. If we can
pass
an
Post by Andrew Musselman
Post by Dmitriy Lyubimov
RDD
Post by Trevor Grant
to
Post by Pat Ferrel
pyspark it would allow custom reductions in python before
plotting,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
even
Post by Trevor Grant
Post by Pat Ferrel
though we will support many natively in Mahout. I’m guessing
that
Post by Andrew Musselman
Post by Dmitriy Lyubimov
this
Post by Trevor Grant
Post by Pat Ferrel
would cross a context boundary and require a write to disk?
1) what does the inter language support look like with Spark
python
Post by Andrew Musselman
Post by Dmitriy Lyubimov
vs
Post by Trevor Grant
Post by Pat Ferrel
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
On May 20, 2016, at 9:54 AM, Trevor Grant <
Dmitriy really nailed it on the head in his reply to the post
which
Post by Andrew Musselman
Post by Dmitriy Lyubimov
I'll
Post by Trevor Grant
Post by Pat Ferrel
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Trevor Grant
Post by Pat Ferrel
using Mahout is the data is to big to fit in memory. If it's
to
Post by Andrew Musselman
big
Post by Dmitriy Lyubimov
to
Post by Trevor Grant
fit
Post by Pat Ferrel
in memory, well then its probably too big to plot each point
(e.g.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
trillions of row, you only have so many pixels). For the
example
Post by Andrew Musselman
I
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that
will
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function
that
Post by Trevor Grant
Post by Andrew Musselman
will
Post by Dmitriy Lyubimov
spit
Post by Trevor Grant
Post by Pat Ferrel
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably
worth
Post by Andrew Musselman
Post by Dmitriy Lyubimov
doing.
Post by Trevor Grant
Post by Pat Ferrel
There are a couple of ways to go about it. I opened up the
discussion
Post by Dmitriy Lyubimov
on
that
to
Post by Andrew Musselman
Post by Dmitriy Lyubimov
mean
Post by Trevor Grant
we
Post by Pat Ferrel
can do it in a way that makes the most sense to Mahout
users...
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
First steps are to include some methods in Mahout that will
do
Post by Trevor Grant
that
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
preprocessing, and one that will turn something into a tsv
string.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
I have some general ideas on possible approached to making an
honest-mahout
Post by Pat Ferrel
interpreter but I want to play in the code and look at the
Flink-Mahout
Post by Trevor Grant
Post by Pat Ferrel
shell a bit before I try to organize my thoughts and present
them.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
...(2) not sure what is the point of supporting distributed
anything.
Post by Dmitriy Lyubimov
It
Post by Trevor Grant
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it in
memory.
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Therefore,
Post by Pat Ferrel
plotting anything distributed potentially presents 2
storage
Post by Trevor Grant
Post by Pat Ferrel
space and overplotting due to number of points. The idea is
that
we
Post by Andrew Musselman
Post by Dmitriy Lyubimov
have
Post by Trevor Grant
to
Post by Pat Ferrel
work out algorithms that condense big data information into
small
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
plottable
Post by Pat Ferrel
information (like density grids, for example, or
histograms)....
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by Pat Ferrel
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth out the
sharp
Post by Andrew Musselman
Post by Dmitriy Lyubimov
edges
Post by Trevor Grant
Post by Pat Ferrel
and
any guidance from you or the Zeppelin community will be a
big
Post by Trevor Grant
Post by Andrew Musselman
help.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
On May 20, 2016, at 8:13 AM, Shannon Quinn <
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this
in
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
zeppelin
Post by Trevor Grant
Post by Pat Ferrel
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I
mentioned
Post by Dmitriy Lyubimov
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an
adventure
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
getting
Zeppelin to work right after I accidently updated to the
new
Post by Andrew Musselman
Post by Dmitriy Lyubimov
snapshot
Post by Trevor Grant
Post by Pat Ferrel
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look
at
Post by Trevor Grant
it
Post by Andrew Musselman
Post by Dmitriy Lyubimov
closely
Post by Trevor Grant
Post by Pat Ferrel
yet
but just a small point: I believe that you'll also need
to
Post by Trevor Grant
add
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion
matrix,
Post by Andrew Musselman
and
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments
below,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
however
Post by Pat Ferrel
with
out further ado, I present two notebooks that integrate
Mahout +
Post by Dmitriy Lyubimov
Post by Trevor Grant
Spark
Post by Pat Ferrel
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin
0.6
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
Post by Pat Ferrel
sparkr
support running already, you may import the following
raw
Post by Trevor Grant
Post by Andrew Musselman
notes
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
So my thoughs on next steps, which I'm positing only as
a
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
starting
Post by Trevor Grant
Post by Pat Ferrel
point
for discussion, and are in no particular order of
- Blog on HOWTO for everyman (assumes no familiarity
with
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Mahout,
Post by Trevor Grant
and
Post by Pat Ferrel
only
enough familiarity with Zeppelin to have Zeppelin +
SparkR
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
support)
Post by Trevor Grant
Post by Pat Ferrel
- Some syntactic sugar somewhere in Mahout to convert a
matrix
Post by Andrew Musselman
Post by Dmitriy Lyubimov
into
Post by Trevor Grant
a
Post by Pat Ferrel
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper
integration
Post by Andrew Musselman
Post by Dmitriy Lyubimov
feels
Post by Trevor Grant
Post by Pat Ferrel
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that
Zeppelin
Post by Dmitriy Lyubimov
is
Post by Trevor Grant
Post by Pat Ferrel
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support
finding
Post by Trevor Grant
out
Post by Andrew Musselman
Post by Dmitriy Lyubimov
what
Post by Trevor Grant
is
Post by Pat Ferrel
the
absolute minimum 'bare-bones' mahout we can include,
e.g.
Post by Trevor Grant
does
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing
how
Post by Andrew Musselman
to
Post by Dmitriy Lyubimov
do
Post by Trevor Grant
the
Post by Pat Ferrel
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to
act
Post by Andrew Musselman
Post by Dmitriy Lyubimov
like a
Post by Trevor Grant
Post by Pat Ferrel
Mahout
interpretter
- This is taken care of by setting some env.
variables,
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
adding
Post by Trevor Grant
some
Post by Pat Ferrel
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a
resource
Post by Andrew Musselman
Post by Dmitriy Lyubimov
pool
Post by Trevor Grant
Post by Pat Ferrel
- This could be done to a disk if you didn't have
zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
4) read the tsv from the resource pool (or disk if you
didn't
Post by Andrew Musselman
Post by Dmitriy Lyubimov
have
Post by Trevor Grant
Post by Pat Ferrel
zeppelin) in R (python soon) and create a <plot package
of
Post by Trevor Grant
Post by Andrew Musselman
your
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
choice>
To Pat's point- this is a kind of clumsy pipeline,
however
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python
but
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular
integration).
Post by Dmitriy Lyubimov
To
Post by Trevor Grant
Post by Pat Ferrel
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require
knowledge
Post by Trevor Grant
of R
Post by Pat Ferrel
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough
but I
Post by Trevor Grant
Post by Andrew Musselman
guess
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala
syntactic
Post by Andrew Musselman
Post by Dmitriy Lyubimov
sugar.
Post by Trevor Grant
Post by Pat Ferrel
What
and
Post by Pat Ferrel
how this all is installed and setup is the next
question.
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
BTW this is what I use elsewhere (Mahout as a lib to
this
Post by Trevor Grant
Post by Andrew Musselman
code)
Post by Dmitriy Lyubimov
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you
have
Post by Trevor Grant
to
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
serialize
Post by Pat Ferrel
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
Post by Pat Ferrel
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself
into
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
thinking
Post by Trevor Grant
Post by Pat Ferrel
things
Post by Pat Ferrel
were 'working' before because I wasn't actually
integrating.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Lower
Post by Trevor Grant
I
Post by Pat Ferrel
will
Post by Pat Ferrel
outline the imports/properties, please look over and
tell
Post by Trevor Grant
me
Post by Andrew Musselman
if
Post by Dmitriy Lyubimov
I'm
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object
that
Post by Andrew Musselman
I
Post by Dmitriy Lyubimov
can
Post by Trevor Grant
Post by Pat Ferrel
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to
a
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
dataframe
Post by Trevor Grant
Post by Pat Ferrel
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some
syntactic
Post by Dmitriy Lyubimov
Post by Trevor Grant
sugar
Post by Pat Ferrel
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to
zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(using
Post by Trevor Grant
Post by Pat Ferrel
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink)
and
Post by Andrew Musselman
an
Post by Dmitriy Lyubimov
R
Post by Trevor Grant
Post by Pat Ferrel
library
Post by Pat Ferrel
containing some functions which will pull the data out
of
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting
package
Post by Dmitriy Lyubimov
you
Post by Trevor Grant
Post by Pat Ferrel
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing
with
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
matplotlib
Post by Pat Ferrel
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
All of this doesn't necessarily require any changing of
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Zeppelin
Post by Pat Ferrel
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up,
I'll
Post by Andrew Musselman
Post by Dmitriy Lyubimov
make
a
Post by Trevor Grant
Post by Pat Ferrel
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using
imports
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's
angularJS.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Things
Post by Pat Ferrel
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an
optional
Post by Andrew Musselman
Post by Dmitriy Lyubimov
build
Post by Trevor Grant
Post by Pat Ferrel
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables
and
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
expose
Post by Trevor Grant
all
Post by Pat Ferrel
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to
Zeppelin
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(which
Post by Trevor Grant
Post by Pat Ferrel
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of
examples
Post by Trevor Grant
Post by Pat Ferrel
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS
then
Post by Trevor Grant
some
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're
doing
Post by Andrew Musselman
Post by Dmitriy Lyubimov
just
Post by Trevor Grant
as
Post by Pat Ferrel
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R
or
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Python
Post by Trevor Grant
and
Post by Pat Ferrel
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro
(which
Post by Andrew Musselman
in
Post by Dmitriy Lyubimov
Post by Trevor Grant
part
Post by Pat Ferrel
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was
too
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
Post by Pat Ferrel
anything seems redundant or missing, please call it
out.
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
spark.serializer
org.apache.spark.serializer.KryoSerializer
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Add artifacts (need to change these to maven not local,
also
Post by Andrew Musselman
Post by Dmitriy Lyubimov
need
Post by Trevor Grant
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
=
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like
registering
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark
conf
Post by Trevor Grant
for
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc
creation
Post by Trevor Grant
code
Post by Pat Ferrel
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight
Spark
Post by Trevor Grant
and
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
passed
Post by Pat Ferrel
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain
cells
Post by Andrew Musselman
Post by Dmitriy Lyubimov
but I
Post by Trevor Grant
Post by Pat Ferrel
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than
teh
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Spark
Post by Trevor Grant
Post by Pat Ferrel
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along
with
Post by Trevor Grant
an
Post by Andrew Musselman
Post by Dmitriy Lyubimov
sc.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the
priority,
Post by Trevor Grant
The
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
mahout
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some
point
Post by Andrew Musselman
we
Post by Dmitriy Lyubimov
may
Post by Trevor Grant
be
Post by Pat Ferrel
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into
the
Post by Trevor Grant
Post by Andrew Musselman
mahout
Post by Dmitriy Lyubimov
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do
something
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it
(probably
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion
on
Post by Andrew Musselman
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add
another
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and
cc
Post by Trevor Grant
dev
Post by Andrew Musselman
Post by Dmitriy Lyubimov
or
Post by Trevor Grant
Post by Pat Ferrel
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[
https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
Post by Andrew Musselman
]<
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo>
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python,
Batchfile,
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit
skeptical
Post by Andrew Musselman
Post by Dmitriy Lyubimov
smile
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way
around
Post by Andrew Musselman
is
Post by Dmitriy Lyubimov
Post by Trevor Grant
much
Post by Pat Ferrel
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in
Zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
without
Post by Trevor Grant
Post by Pat Ferrel
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all
around
Post by Andrew Musselman
Post by Dmitriy Lyubimov
very
Post by Trevor Grant
Post by Pat Ferrel
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back
on
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
thread.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually
have
Post by Andrew Musselman
Post by Dmitriy Lyubimov
both
Post by Trevor Grant
Post by Pat Ferrel
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future
we're
Post by Andrew Musselman
Post by Dmitriy Lyubimov
going
to
Post by Trevor Grant
Post by Pat Ferrel
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple
scatter
Post by Dmitriy Lyubimov
Post by Trevor Grant
plots
Post by Pat Ferrel
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and
the
Post by Trevor Grant
Post by Andrew Musselman
pngs.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
But what youre saying in your last email would be
great!
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing
it...
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
OK. I'll read through the examples and try to do
something
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it
(probably
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
ggplot).
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion
on
Post by Trevor Grant
Post by Andrew Musselman
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add
another
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking
visualization
Post by Andrew Musselman
Post by Dmitriy Lyubimov
should
Post by Trevor Grant
be
Post by Pat Ferrel
a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage
zeppelins
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're
trying
Post by Andrew Musselman
Post by Dmitriy Lyubimov
to
is
Post by Trevor Grant
Post by Pat Ferrel
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We
plan
Post by Andrew Musselman
on
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've
just
Post by Andrew Musselman
Post by Dmitriy Lyubimov
reworked
Post by Trevor Grant
Post by Pat Ferrel
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid
and
Post by Andrew Musselman
Post by Dmitriy Lyubimov
surface
Post by Trevor Grant
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based
and
Post by Trevor Grant
can
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples on the
closed
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile
as a
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are
looking
at
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed
the
Post by Trevor Grant
new
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with
AngularJS
Post by Dmitriy Lyubimov
if
Post by Trevor Grant
Post by Pat Ferrel
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output
pngs,
Post by Trevor Grant
Post by Andrew Musselman
maybe
Post by Dmitriy Lyubimov
Post by Trevor Grant
other
Post by Pat Ferrel
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble
getting
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
permission
Post by Pat Ferrel
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I
pulled
Post by Andrew Musselman
in
Post by Dmitriy Lyubimov
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be
pulling
Post by Trevor Grant
in
Post by Andrew Musselman
Post by Dmitriy Lyubimov
more?
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public)
this
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy-
eg
Post by Trevor Grant
why
Post by Andrew Musselman
Post by Dmitriy Lyubimov
not
Post by Trevor Grant
let
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven
artifacts,
Post by Trevor Grant
Post by Pat Ferrel
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the
races...
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
that said, adding a build profile like -PsparkMahout
and
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly
straight
Post by Andrew Musselman
Post by Dmitriy Lyubimov
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be
more
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R
just
Post by Trevor Grant
to
Post by Andrew Musselman
Post by Dmitriy Lyubimov
show
Post by Trevor Grant
off
Post by Pat Ferrel
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth
building
Post by Dmitriy Lyubimov
a
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post
with
Post by Andrew Musselman
Post by Dmitriy Lyubimov
examples
Post by Trevor Grant
Post by Pat Ferrel
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much.
I've
Post by Andrew Musselman
Post by Dmitriy Lyubimov
done a
Post by Trevor Grant
Post by Pat Ferrel
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me
know
Post by Andrew Musselman
if
Post by Dmitriy Lyubimov
you
Post by Trevor Grant
Post by Pat Ferrel
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Trevor Grant
Post by Pat Ferrel
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while
ago.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(The
Post by Trevor Grant
old
Post by Pat Ferrel
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually
getting
Post by Andrew Musselman
Post by Dmitriy Lyubimov
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a
properly
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
functioning
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could
leverage
Post by Andrew Musselman
all
Post by Dmitriy Lyubimov
of
Post by Trevor Grant
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or
anything
Post by Andrew Musselman
Post by Dmitriy Lyubimov
in R
Post by Trevor Grant
Post by Pat Ferrel
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as
well.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of
Mahout
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara
is a
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a
collection
Post by Trevor Grant
of
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit
is
Post by Andrew Musselman
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by
nature
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and
supports
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo
support
Post by Trevor Grant
too.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
In any case you probably know we have a Mahout version
of
Post by Andrew Musselman
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
Spark
Post by Pat Ferrel
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of
Zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(code
is
Post by Trevor Grant
Post by Pat Ferrel
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
Post by Pat Ferrel
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are
interested
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I
understand
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS,
which I
Post by Dmitriy Lyubimov
Post by Trevor Grant
have
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to
proceed.
Post by Dmitriy Lyubimov
We
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our
conversation,
Post by Dmitriy Lyubimov
I
Post by Trevor Grant
Post by Pat Ferrel
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat
Ferrel
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
(Mahout
Post by Pat Ferrel
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at
Zeppelin
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate
your
Post by Andrew Musselman
Post by Dmitriy Lyubimov
help
Post by Trevor Grant
(as
Post by Pat Ferrel
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Mahout
Post by Trevor Grant
Post by Pat Ferrel
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org
Post by Trevor Grant
<
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all
hangout
Post by Andrew Musselman
and
Post by Dmitriy Lyubimov
not
Post by Trevor Grant
Post by Pat Ferrel
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Eric Charles
2016-05-30 02:57:43 UTC
Permalink
Have you seen [ZEPPELIN-116] Add Mahout Support for Spark Interpreter?

https://github.com/apache/incubator-zeppelin/pull/928

It declares in the spark interpreter the mahout deps, and creates the
sdc (spark distributed context).
OK cool. Just wanted to make sure I wasn't stealing anyone's baby or
duplicating efforts.
1- The blog post referenced the linear-regression example notebook twice-
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
(I still need to update with a blurb about sampling, however it is done in
that note...) So to any who tried the blog, I huge appology because that
notebook is where all of the 'magic happened', (all of the screen shots /
gg-plots / etc happened there).
https://github.com/rawkintrevo/incubator-zeppelin
if you build, and set 'spark.mahout' to 'true' in the Spark Interpretter
properties, you have a Mahout interpreter. This is the minimally invasive
way to do it, I'll be opening a PR soon, we'll see what the gang over at
Zeppelin say.
I'll still need docs and an example notebook, but I'm waiting to make sure
I don't need to do a major refactor before I get carried away with those
activities.
In essence when 'spark-mahout' is 'true' you jump right in on r-like dsl
and you have a sdc declared based on the underlying sc.
I am not sure if messing with the very "sacrosanct" Zeppelin-Spark
interpreter is gonna go down well with the Spark insanity. I would prefer
having a separate MAhout-Spark-Zeppelin interpreter under Zeppelin project
if that's acceptable to the Zeppelin folks, even though most of it might be
repeatee.
What do others have to say?
have a good holiday weekend,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Thx Trevor,
Re: m-1854, It was something that we started when were first discussing
using the smile plots for and trying to pipe them over to Zeppelin .. As
far as I know there was not progress started on it.. I've unassigned it.
Feel free to Assign any Jiras to yourself. I think that m-1854 is
similar
to the mahout-spark-shell, so I may be able to help out there.
________________________________________
Sent: Saturday, May 28, 2016 11:21:44 PM
Subject: Re: Future Mahout - Zeppelin work
Created a subtask on 1855 for tsv strings.
Looking at 1854 assigned to Pat Ferrel, what's your progress to date?
How
can I help?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Great!
When you free up and have the time, could you create some Jiras for
these?
We actually have MAHOUT-1852 open for Histograms already, and
MAHOUT-1854
and MAHOUT-1855 (early Zeppelin integration Jiras). I can close m-1854
and
m-1855 out and we can start new ones if they're not relevant anymore or
we
can just go with those.
Thanks
________________________________________
Sent: Thursday, May 26, 2016 3:17:22 PM
Subject: Re: Future Mahout - Zeppelin work
Short answer: it is high priority. I think it will be a Mahout
interpreter
into Zeppelin, and given that plans are on hold for a Flink-Mahout in
the
short term, I think it should be a piggy-back spark interpreter (e.g.
exposed through something like %spark.mahout). So I have thoughts,
but
no
plan. Been busy with a couple of other commitments.
A function that will convert small matrices into TSV strings
Convenience functions for sampling super-large matrices into things
like
histograms, etc, that one would want to plot. I.e. histogram bucketing?
(less important for the moment)
an interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Suneel Marthi
While on this subject, do we have a plan yet of integrating Zeppelin
into
Post by Suneel Marthi
Mahout (or the converse) of having Mahout specific interpreter for
Zeppelin? I think that shuld be high priority in the short term.
On Thu, May 26, 2016 at 1:17 PM, Trevor Grant <
Post by Trevor Grant
Ahh, like the "Sample From Matrix" paragraph in the notebook.
Yea that seems like a good add. If not this afternoon, I'll include
it
Post by Suneel Marthi
Post by Trevor Grant
Saturday.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Suneel Marthi
Post by Trevor Grant
On Thu, May 26, 2016 at 11:52 AM, Andrew Palumbo <
Trevor, I was reading over your blog last night again- first time
since
Post by Suneel Marthi
Post by Trevor Grant
you updated. It is great!
I have one suggestion being adding in a code line on how the the
sampling
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148
Post by Suneel Marthi
Post by Trevor Grant
mxSin = drmSampleKRows(drmSin, 1000, replacement = false)
Maybe you omitted this intentionally?
Andy
________________________________________
Sent: Friday, May 20, 2016 7:56:20 PM
Subject: Re: Future Mahout - Zeppelin work
Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a
version
Post by Suneel Marthi
Post by Trevor Grant
is
uninformative to me. I'd say if possible, you're first
troubleshooting
Post by Suneel Marthi
Post by Trevor Grant
measure would be to re clone or do a "git fetch upstream" to get
up
to
Post by Suneel Marthi
Post by Trevor Grant
the
very latest
Sorry for delayed reply
Tg
On May 20, 2016 5:36 PM, "Andrew Musselman" <
Post by Andrew Musselman
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the
dependencies;
Post by Suneel Marthi
Post by Trevor Grant
is
Post by Andrew Musselman
that a feature in more modern zep?
On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
no parenthesis.
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi <
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
Post by Trevor Grant
Hey Pat,
If you spit out a TSV - you can import into pyspark /
matplotlib
Post by Trevor Grant
from
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
resource pool in essentially the same way and use that
plotting
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
library
Post by Dmitriy Lyubimov
if
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas
and
Post by Suneel Marthi
use
Post by Trevor Grant
all
Post by Andrew Musselman
Post by Dmitriy Lyubimov
of
Post by Trevor Grant
the pandas plotting as well (though I think it is for the
most
Post by Suneel Marthi
Post by Trevor Grant
part,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
also
Post by Trevor Grant
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
In Zeppelin, unless you specify otherwise, pyspark,
sparkr,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
spark-sql,
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
scala-spark all share the same spark context you can
create
Post by Suneel Marthi
RDDs
Post by Trevor Grant
in
Post by Andrew Musselman
one
Post by Dmitriy Lyubimov
Post by Trevor Grant
language and access them / work on them in another (so I
understand).
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
So in Mahout can you "save" a matrix as a RDD? e.g.
something
Post by Suneel Marthi
Post by Trevor Grant
like
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
-Virgil*
Post by Dmitriy Lyubimov
Post by Trevor Grant
On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
Post by Pat Ferrel
Agreed.
BTW I don’t want to stall progress but being the most
ignorant
Post by Trevor Grant
of
Post by Andrew Musselman
Post by Dmitriy Lyubimov
plot
Post by Trevor Grant
Post by Pat Ferrel
libs, I’ll ask if we should consider python and
matplotlib.
Post by Suneel Marthi
In
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
another
Post by Trevor Grant
Post by Pat Ferrel
project we use python because of the RDD support on
Spark
Post by Suneel Marthi
Post by Trevor Grant
though
Post by Andrew Musselman
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
visualizations are extremely limited in our case. If we
can
Post by Suneel Marthi
Post by Trevor Grant
pass
an
Post by Andrew Musselman
Post by Dmitriy Lyubimov
RDD
Post by Trevor Grant
to
Post by Pat Ferrel
pyspark it would allow custom reductions in python
before
Post by Suneel Marthi
Post by Trevor Grant
plotting,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
even
Post by Trevor Grant
Post by Pat Ferrel
though we will support many natively in Mahout. I’m
guessing
Post by Suneel Marthi
Post by Trevor Grant
that
Post by Andrew Musselman
Post by Dmitriy Lyubimov
this
Post by Trevor Grant
Post by Pat Ferrel
would cross a context boundary and require a write to
disk?
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
1) what does the inter language support look like with
Spark
Post by Suneel Marthi
Post by Trevor Grant
python
Post by Andrew Musselman
Post by Dmitriy Lyubimov
vs
Post by Trevor Grant
Post by Pat Ferrel
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
On May 20, 2016, at 9:54 AM, Trevor Grant <
Dmitriy really nailed it on the head in his reply to
the
post
Post by Suneel Marthi
Post by Trevor Grant
which
Post by Andrew Musselman
Post by Dmitriy Lyubimov
I'll
Post by Trevor Grant
Post by Pat Ferrel
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Trevor Grant
Post by Pat Ferrel
using Mahout is the data is to big to fit in memory.
If
it's
Post by Suneel Marthi
Post by Trevor Grant
to
Post by Andrew Musselman
big
Post by Dmitriy Lyubimov
to
Post by Trevor Grant
fit
Post by Pat Ferrel
in memory, well then its probably too big to plot each
point
Post by Suneel Marthi
Post by Trevor Grant
(e.g.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
trillions of row, you only have so many pixels). For
the
Post by Suneel Marthi
Post by Trevor Grant
example
Post by Andrew Musselman
I
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions
that
Post by Suneel Marthi
Post by Trevor Grant
will
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a
function
Post by Suneel Marthi
that
Post by Trevor Grant
Post by Andrew Musselman
will
Post by Dmitriy Lyubimov
spit
Post by Trevor Grant
Post by Pat Ferrel
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is
probably
Post by Suneel Marthi
Post by Trevor Grant
worth
Post by Andrew Musselman
Post by Dmitriy Lyubimov
doing.
Post by Trevor Grant
Post by Pat Ferrel
There are a couple of ways to go about it. I opened up
the
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
discussion
Post by Dmitriy Lyubimov
on
take
Post by Suneel Marthi
Post by Trevor Grant
that
to
Post by Andrew Musselman
Post by Dmitriy Lyubimov
mean
Post by Trevor Grant
we
Post by Pat Ferrel
can do it in a way that makes the most sense to Mahout
users...
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
First steps are to include some methods in Mahout that
will
Post by Suneel Marthi
do
Post by Trevor Grant
that
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
preprocessing, and one that will turn something into a
tsv
Post by Suneel Marthi
Post by Trevor Grant
string.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
I have some general ideas on possible approached to
making
an
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
honest-mahout
Post by Pat Ferrel
interpreter but I want to play in the code and look at
the
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Flink-Mahout
Post by Trevor Grant
Post by Pat Ferrel
shell a bit before I try to organize my thoughts and
present
Post by Suneel Marthi
Post by Trevor Grant
them.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
...(2) not sure what is the point of supporting
distributed
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
anything.
Post by Dmitriy Lyubimov
It
Post by Trevor Grant
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it in
memory.
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Therefore,
Post by Pat Ferrel
plotting anything distributed potentially presents 2
storage
Post by Trevor Grant
Post by Pat Ferrel
space and overplotting due to number of points. The
idea
is
Post by Suneel Marthi
Post by Trevor Grant
that
we
Post by Andrew Musselman
Post by Dmitriy Lyubimov
have
Post by Trevor Grant
to
Post by Pat Ferrel
work out algorithms that condense big data information
into
Post by Suneel Marthi
Post by Trevor Grant
small
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
plottable
Post by Pat Ferrel
information (like density grids, for example, or
histograms)....
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
-Virgil*
Post by Trevor Grant
Post by Pat Ferrel
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Post by Pat Ferrel
Great job Trevor, we’ll need this detail to smooth
out
the
Post by Suneel Marthi
Post by Trevor Grant
sharp
Post by Andrew Musselman
Post by Dmitriy Lyubimov
edges
Post by Trevor Grant
Post by Pat Ferrel
and
Post by Pat Ferrel
any guidance from you or the Zeppelin community will
be a
Post by Suneel Marthi
big
Post by Trevor Grant
Post by Andrew Musselman
help.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
On May 20, 2016, at 8:13 AM, Shannon Quinn <
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try
this
Post by Suneel Marthi
in
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
zeppelin
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
-Virgil*
Post by Pat Ferrel
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar
that I
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
mentioned
Post by Dmitriy Lyubimov
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on
an
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
adventure
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
getting
Zeppelin to work right after I accidently updated
to
the
Post by Suneel Marthi
Post by Trevor Grant
new
Post by Andrew Musselman
Post by Dmitriy Lyubimov
snapshot
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md
now.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
-Virgil*
Post by Pat Ferrel
Post by Pat Ferrel
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to
look
Post by Suneel Marthi
at
Post by Trevor Grant
it
Post by Andrew Musselman
Post by Dmitriy Lyubimov
closely
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
yet
but just a small point: I believe that you'll also
need
Post by Suneel Marthi
to
Post by Trevor Grant
add
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
For things like the classification stats,
confusion
Post by Suneel Marthi
Post by Trevor Grant
matrix,
Post by Andrew Musselman
and
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's
comments
Post by Suneel Marthi
Post by Trevor Grant
below,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
however
Post by Pat Ferrel
Post by Pat Ferrel
with
out further ado, I present two notebooks that
integrate
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Mahout +
Post by Dmitriy Lyubimov
Post by Trevor Grant
Spark
Post by Pat Ferrel
Post by Pat Ferrel
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of
Zeppelin
Post by Suneel Marthi
Post by Trevor Grant
0.6
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
sparkr
support running already, you may import the
following
Post by Suneel Marthi
raw
Post by Trevor Grant
Post by Andrew Musselman
notes
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
So my thoughs on next steps, which I'm positing
only
as
Post by Suneel Marthi
a
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
starting
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
point
for discussion, and are in no particular order of
- Blog on HOWTO for everyman (assumes no
familiarity
Post by Suneel Marthi
with
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Mahout,
Post by Trevor Grant
and
Post by Pat Ferrel
Post by Pat Ferrel
only
enough familiarity with Zeppelin to have Zeppelin
+
Post by Suneel Marthi
SparkR
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
support)
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
- Some syntactic sugar somewhere in Mahout to
convert
a
Post by Suneel Marthi
Post by Trevor Grant
matrix
Post by Andrew Musselman
Post by Dmitriy Lyubimov
into
Post by Trevor Grant
a
Post by Pat Ferrel
Post by Pat Ferrel
tsv
string. (with some sanity, eg a sample of a
matrix)
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
- Figure out with Zeppelin community what deeper
integration
Post by Andrew Musselman
Post by Dmitriy Lyubimov
feels
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is
that
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Zeppelin
Post by Dmitriy Lyubimov
is
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
first
and foremost a datascience tool for non technical
users.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
- If we go that route I'll need some more support
finding
Post by Trevor Grant
out
Post by Andrew Musselman
Post by Dmitriy Lyubimov
what
Post by Trevor Grant
is
Post by Pat Ferrel
Post by Pat Ferrel
the
absolute minimum 'bare-bones' mahout we can
include,
Post by Suneel Marthi
e.g.
Post by Trevor Grant
does
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
user
Post by Pat Ferrel
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph
showing
Post by Suneel Marthi
Post by Trevor Grant
how
Post by Andrew Musselman
to
Post by Dmitriy Lyubimov
do
Post by Trevor Grant
the
Post by Pat Ferrel
Post by Pat Ferrel
same
thing in Python.
1) Setting up a standard Zeppelin Spark
Interpretter
to
Post by Suneel Marthi
Post by Trevor Grant
act
Post by Andrew Musselman
Post by Dmitriy Lyubimov
like a
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Mahout
interpretter
- This is taken care of by setting some env.
variables,
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
adding
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to
a
Post by Suneel Marthi
Post by Trevor Grant
resource
Post by Andrew Musselman
Post by Dmitriy Lyubimov
pool
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
- This could be done to a disk if you didn't
have
Post by Suneel Marthi
Post by Trevor Grant
zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
4) read the tsv from the resource pool (or disk if
you
Post by Suneel Marthi
Post by Trevor Grant
didn't
Post by Andrew Musselman
Post by Dmitriy Lyubimov
have
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
zeppelin) in R (python soon) and create a <plot
package
Post by Suneel Marthi
of
Post by Trevor Grant
Post by Andrew Musselman
your
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
choice>
Post by Pat Ferrel
To Pat's point- this is a kind of clumsy pipeline,
however
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil*
Post by Pat Ferrel
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or
python
Post by Suneel Marthi
but
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
pipeline
Post by Pat Ferrel
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular
integration).
Post by Dmitriy Lyubimov
To
Post by Trevor Grant
Post by Pat Ferrel
get
Post by Pat Ferrel
Post by Pat Ferrel
graphics out of Mahout it would be nice to not
require
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
knowledge
Post by Trevor Grant
of R
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad
enough
Post by Suneel Marthi
but I
Post by Trevor Grant
Post by Andrew Musselman
guess
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala
syntactic
Post by Andrew Musselman
Post by Dmitriy Lyubimov
sugar.
Post by Trevor Grant
Post by Pat Ferrel
What
Post by Pat Ferrel
and
Post by Pat Ferrel
how this all is installed and setup is the next
question.
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
BTW this is what I use elsewhere (Mahout as a lib
to
Post by Suneel Marthi
this
Post by Trevor Grant
Post by Andrew Musselman
code)
Post by Dmitriy Lyubimov
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when
you
Post by Suneel Marthi
have
Post by Trevor Grant
to
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
serialize
Post by Pat Ferrel
Post by Pat Ferrel
a
Post by Pat Ferrel
mahout specific data type like vector of drm,
something
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
registered
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage
Zeppelin
for
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
charting.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled
myself
Post by Suneel Marthi
into
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
thinking
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
things
Post by Pat Ferrel
were 'working' before because I wasn't actually
integrating.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Lower
Post by Trevor Grant
I
Post by Pat Ferrel
Post by Pat Ferrel
will
Post by Pat Ferrel
outline the imports/properties, please look over
and
Post by Suneel Marthi
tell
Post by Trevor Grant
me
Post by Andrew Musselman
if
Post by Dmitriy Lyubimov
I'm
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable
object
Post by Suneel Marthi
Post by Trevor Grant
that
Post by Andrew Musselman
I
Post by Dmitriy Lyubimov
can
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the
object
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
3) collect the object in an R paragraph, convert
it
to
Post by Suneel Marthi
a
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
dataframe
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add
some
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
syntactic
Post by Dmitriy Lyubimov
Post by Trevor Grant
sugar
Post by Pat Ferrel
Post by Pat Ferrel
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass
to
Post by Suneel Marthi
Post by Trevor Grant
zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(using
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
resource
Post by Pat Ferrel
pools so the same functionality can be reused in
Flink)
Post by Suneel Marthi
Post by Trevor Grant
and
Post by Andrew Musselman
an
Post by Dmitriy Lyubimov
R
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
library
Post by Pat Ferrel
containing some functions which will pull the
data
out
Post by Suneel Marthi
of
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
resource
Post by Pat Ferrel
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any
plotting
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
package
Post by Dmitriy Lyubimov
you
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same
thing
Post by Suneel Marthi
with
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
matplotlib
Post by Pat Ferrel
Post by Pat Ferrel
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
All of this doesn't necessarily require any
changing
of
Post by Suneel Marthi
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to
set
up,
Post by Suneel Marthi
Post by Trevor Grant
I'll
Post by Andrew Musselman
Post by Dmitriy Lyubimov
make
a
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on
using
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
imports
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at
home
on
Post by Suneel Marthi
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's
angularJS.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Things
Post by Pat Ferrel
Post by Pat Ferrel
get
a
Post by Pat Ferrel
little more harry in that case, but we could make
an
Post by Suneel Marthi
Post by Trevor Grant
optional
Post by Andrew Musselman
Post by Dmitriy Lyubimov
build
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at
tables
Post by Suneel Marthi
and
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
expose
Post by Trevor Grant
all
Post by Pat Ferrel
Post by Pat Ferrel
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to
Zeppelin
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(which
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a
lot
of
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
examples
Post by Trevor Grant
Post by Pat Ferrel
where
Post by Pat Ferrel
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to
AngularJS
Post by Suneel Marthi
then
Post by Trevor Grant
some
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however,
you're
Post by Suneel Marthi
Post by Trevor Grant
doing
Post by Andrew Musselman
Post by Dmitriy Lyubimov
just
Post by Trevor Grant
as
Post by Pat Ferrel
Post by Pat Ferrel
much
Post by Pat Ferrel
work, if not more than it would be to simply pass
to
R
Post by Suneel Marthi
or
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Python
Post by Trevor Grant
and
Post by Pat Ferrel
Post by Pat Ferrel
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using
Kyro
Post by Suneel Marthi
Post by Trevor Grant
(which
Post by Andrew Musselman
in
Post by Dmitriy Lyubimov
Post by Trevor Grant
part
Post by Pat Ferrel
Post by Pat Ferrel
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it
was
Post by Suneel Marthi
too
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
anything seems redundant or missing, please call
it
Post by Suneel Marthi
out.
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
spark.serializer
org.apache.spark.serializer.KryoSerializer
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Add artifacts (need to change these to maven not
local,
Post by Suneel Marthi
Post by Trevor Grant
also
Post by Andrew Musselman
Post by Dmitriy Lyubimov
need
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
add/change one jar per below, however this does
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Add following code to first paragraph of
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import
org.apache.mahout.math.scalabindings.RLikeOps._
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Suneel Marthi
=
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like
registering
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the
Spark
Post by Suneel Marthi
conf
Post by Trevor Grant
for
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in
the
mc
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
creation
Post by Trevor Grant
code
Post by Pat Ferrel
Post by Pat Ferrel
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in
straight
Post by Suneel Marthi
Spark
Post by Trevor Grant
and
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
passed
Post by Pat Ferrel
Post by Pat Ferrel
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak
brain
Post by Suneel Marthi
Post by Trevor Grant
cells
Post by Andrew Musselman
Post by Dmitriy Lyubimov
but I
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different
than
Post by Suneel Marthi
Post by Trevor Grant
teh
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Spark
Post by Trevor Grant
Post by Pat Ferrel
shell
Post by Pat Ferrel
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or
along
Post by Suneel Marthi
with
Post by Trevor Grant
an
Post by Andrew Musselman
Post by Dmitriy Lyubimov
sc.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be
having?
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the
priority,
Post by Trevor Grant
The
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
mahout
Post by Pat Ferrel
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at
some
Post by Suneel Marthi
Post by Trevor Grant
point
Post by Andrew Musselman
we
Post by Dmitriy Lyubimov
may
Post by Trevor Grant
be
Post by Pat Ferrel
Post by Pat Ferrel
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features
into
Post by Suneel Marthi
the
Post by Trevor Grant
Post by Andrew Musselman
mahout
Post by Dmitriy Lyubimov
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to
do
Post by Suneel Marthi
Post by Trevor Grant
something
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on
it
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
(probably
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen
discussion
Post by Suneel Marthi
Post by Trevor Grant
on
Post by Andrew Musselman
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add
another
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin
work
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I just signed up for dev, should i just reply
all
and
Post by Suneel Marthi
cc
Post by Trevor Grant
dev
Post by Andrew Musselman
Post by Dmitriy Lyubimov
or
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[
https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
Post by Andrew Musselman
]<
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo>
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in
Python,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Batchfile,
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org
"Fortunate is he, who is able to know the causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy
Lyubimov
<
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
fwiw ggplot2 is pretty darn advanced:) i am a
bit
Post by Suneel Marthi
Post by Trevor Grant
skeptical
Post by Andrew Musselman
Post by Dmitriy Lyubimov
smile
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other
way
Post by Suneel Marthi
Post by Trevor Grant
around
Post by Andrew Musselman
is
Post by Dmitriy Lyubimov
Post by Trevor Grant
much
Post by Pat Ferrel
Post by Pat Ferrel
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available
in
Post by Suneel Marthi
Post by Trevor Grant
Zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
without
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be
an
all
Post by Suneel Marthi
Post by Trevor Grant
around
Post by Andrew Musselman
Post by Dmitriy Lyubimov
very
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo
<
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
sorry- answering a question from a couple emails
back
Post by Suneel Marthi
on
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
thread.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to
eventually
Post by Suneel Marthi
Post by Trevor Grant
have
Post by Andrew Musselman
Post by Dmitriy Lyubimov
both
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the
future
Post by Suneel Marthi
Post by Trevor Grant
we're
Post by Andrew Musselman
Post by Dmitriy Lyubimov
going
to
Post by Trevor Grant
Post by Pat Ferrel
be
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than
simple
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
scatter
Post by Dmitriy Lyubimov
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular
and
Post by Suneel Marthi
the
Post by Trevor Grant
Post by Andrew Musselman
pngs.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
But what youre saying in your last email would
be
Post by Suneel Marthi
great!
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin
work
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I somehow replied to your last email without
seeing
Post by Suneel Marthi
Post by Trevor Grant
it...
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
OK. I'll read through the examples and try to do
something
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it
(probably
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
ggplot).
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen
discussion
Post by Suneel Marthi
on
Post by Trevor Grant
Post by Andrew Musselman
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add
another
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org
"Fortunate is he, who is able to know the causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking
visualization
Post by Andrew Musselman
Post by Dmitriy Lyubimov
should
Post by Trevor Grant
be
Post by Pat Ferrel
a
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage
zeppelins
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org
"Fortunate is he, who is able to know the causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo
<
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Sorry- to be a little more clear, Part of what
we're
Post by Suneel Marthi
Post by Trevor Grant
trying
Post by Andrew Musselman
Post by Dmitriy Lyubimov
to
is
Post by Trevor Grant
Post by Pat Ferrel
to
Post by Pat Ferrel
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with
Zeppelin.
We
Post by Suneel Marthi
Post by Trevor Grant
plan
Post by Andrew Musselman
on
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
adding
Post by Pat Ferrel
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin
work
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Awesome!
most of the hard work was done by Dmitriy[??] ,
I've
Post by Suneel Marthi
Post by Trevor Grant
just
Post by Andrew Musselman
Post by Dmitriy Lyubimov
reworked
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's
refactoring.
Post by Suneel Marthi
Post by Trevor Grant
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
For the new plotting features that we're working
on.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
the plotting is still a work in progress, and
the
grid
Post by Suneel Marthi
Post by Trevor Grant
and
Post by Andrew Musselman
Post by Dmitriy Lyubimov
surface
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing
based
Post by Suneel Marthi
and
Post by Trevor Grant
can
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples on
the
Post by Suneel Marthi
Post by Trevor Grant
closed
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy
Lyubimov
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin
work
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
This is only the beginning. Andy has been using
Smile
Post by Suneel Marthi
Post by Trevor Grant
as a
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We
are
Post by Suneel Marthi
Post by Trevor Grant
looking
at
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to
feed
Post by Suneel Marthi
the
Post by Trevor Grant
new
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar
with
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
AngularJS
Post by Dmitriy Lyubimov
if
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can
output
Post by Suneel Marthi
pngs,
Post by Trevor Grant
Post by Andrew Musselman
maybe
Post by Dmitriy Lyubimov
Post by Trevor Grant
other
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has
rouble
Post by Suneel Marthi
Post by Trevor Grant
getting
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
permission
Post by Pat Ferrel
Post by Pat Ferrel
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work
for
me!
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I've got the linear regression example working
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over
import. I
Post by Suneel Marthi
Post by Trevor Grant
pulled
Post by Andrew Musselman
in
Post by Dmitriy Lyubimov
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be
pulling
Post by Trevor Grant
in
Post by Andrew Musselman
Post by Dmitriy Lyubimov
more?
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't
public)
Post by Suneel Marthi
Post by Trevor Grant
this
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too
easy-
Post by Suneel Marthi
eg
Post by Trevor Grant
why
Post by Andrew Musselman
Post by Dmitriy Lyubimov
not
Post by Trevor Grant
let
Post by Pat Ferrel
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate
maven
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
artifacts,
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import
org.apache.mahout.math.scalabindings.RLikeOps._
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to
the
Post by Suneel Marthi
Post by Trevor Grant
races...
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
that said, adding a build profile like
-PsparkMahout
Post by Suneel Marthi
and
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly
straight
Post by Andrew Musselman
Post by Dmitriy Lyubimov
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that
would
be
Post by Suneel Marthi
Post by Trevor Grant
more
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular
or
R
Post by Suneel Marthi
just
Post by Trevor Grant
to
Post by Andrew Musselman
Post by Dmitriy Lyubimov
show
Post by Trevor Grant
off
Post by Pat Ferrel
Post by Pat Ferrel
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even
worth
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
building
Post by Dmitriy Lyubimov
a
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog
post
Post by Suneel Marthi
Post by Trevor Grant
with
Post by Andrew Musselman
Post by Dmitriy Lyubimov
examples
Post by Trevor Grant
Post by Pat Ferrel
on
Post by Pat Ferrel
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo
<
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Hi Trevor, welcome!
It's great to have you helping out, thanks very
much.
Post by Suneel Marthi
Post by Trevor Grant
I've
Post by Andrew Musselman
Post by Dmitriy Lyubimov
done a
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so
let
me
Post by Suneel Marthi
Post by Trevor Grant
know
Post by Andrew Musselman
if
Post by Dmitriy Lyubimov
you
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Trevor Grant
Post by Pat Ferrel
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin
work
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a
while
Post by Suneel Marthi
Post by Trevor Grant
ago.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(The
Post by Trevor Grant
old
Post by Pat Ferrel
Post by Pat Ferrel
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm
actually
Post by Suneel Marthi
Post by Trevor Grant
getting
Post by Andrew Musselman
Post by Dmitriy Lyubimov
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a
properly
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
functioning
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one
could
Post by Suneel Marthi
Post by Trevor Grant
leverage
Post by Andrew Musselman
all
Post by Dmitriy Lyubimov
of
Post by Trevor Grant
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS,
or
Post by Suneel Marthi
Post by Trevor Grant
anything
Post by Andrew Musselman
Post by Dmitriy Lyubimov
in R
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack
channel
as
Post by Suneel Marthi
Post by Trevor Grant
well.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Nice to meet you all, looking forward to helping
out!
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi
<
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
FYi...
Trevor was there for my talk, so he has some
idea
of
Post by Suneel Marthi
Post by Trevor Grant
Mahout
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know
Mahout-Samsara
Post by Suneel Marthi
Post by Trevor Grant
is a
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a
collection
Post by Trevor Grant
of
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major
benefit
Post by Suneel Marthi
Post by Trevor Grant
is
Post by Andrew Musselman
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
during
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code
is
by
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
nature
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is
R-like
and
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
supports
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
tensor
Post by Pat Ferrel
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo
support
Post by Trevor Grant
too.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
In any case you probably know we have a Mahout
version
Post by Suneel Marthi
Post by Trevor Grant
of
Post by Andrew Musselman
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
Spark
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of
Zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(code
is
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very
nice
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
visualizations
Post by Trevor Grant
of
Post by Pat Ferrel
Post by Pat Ferrel
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project
are
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
interested
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From
what I
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
understand
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on
AngularJS,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
which I
Post by Dmitriy Lyubimov
Post by Trevor Grant
have
Post by Pat Ferrel
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about
how
to
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
proceed.
Post by Dmitriy Lyubimov
We
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our
conversation,
Post by Dmitriy Lyubimov
I
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and
Pat
Post by Suneel Marthi
Post by Trevor Grant
Ferrel
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
(Mahout
Post by Pat Ferrel
Post by Pat Ferrel
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively
looking
at
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Zeppelin
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would
appreciate
Post by Suneel Marthi
Post by Trevor Grant
your
Post by Andrew Musselman
Post by Dmitriy Lyubimov
help
Post by Trevor Grant
(as
Post by Pat Ferrel
Post by Pat Ferrel
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r
revamping
Post by Suneel Marthi
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Mahout
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org
Post by Trevor Grant
<
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we
all
Post by Suneel Marthi
Post by Trevor Grant
hangout
Post by Andrew Musselman
and
Post by Dmitriy Lyubimov
not
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Suneel Marthi
2016-05-30 03:47:07 UTC
Permalink
Hi Eric,

We r talking about the same PR which is a tweak of existing Spark-Zeppelin
interpreter.
What we r looking at is a specific Mahout-Spark-Zeppelin interpreter that
is independent of above?

BTW Eric, nice to see u on Mahout mailing lists, u didn't make it to
Vancouver this time?
Post by Eric Charles
Have you seen [ZEPPELIN-116] Add Mahout Support for Spark Interpreter?
https://github.com/apache/incubator-zeppelin/pull/928
It declares in the spark interpreter the mahout deps, and creates the sdc
(spark distributed context).
OK cool. Just wanted to make sure I wasn't stealing anyone's baby or
duplicating efforts.
1- The blog post referenced the linear-regression example notebook twice-
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
(I still need to update with a blurb about sampling, however it is done in
that note...) So to any who tried the blog, I huge appology because that
notebook is where all of the 'magic happened', (all of the screen shots /
gg-plots / etc happened there).
https://github.com/rawkintrevo/incubator-zeppelin
if you build, and set 'spark.mahout' to 'true' in the Spark Interpretter
properties, you have a Mahout interpreter. This is the minimally invasive
way to do it, I'll be opening a PR soon, we'll see what the gang over at
Zeppelin say.
I'll still need docs and an example notebook, but I'm waiting to make sure
I don't need to do a major refactor before I get carried away with those
activities.
In essence when 'spark-mahout' is 'true' you jump right in on r-like dsl
and you have a sdc declared based on the underlying sc.
I am not sure if messing with the very "sacrosanct" Zeppelin-Spark
interpreter is gonna go down well with the Spark insanity. I would prefer
having a separate MAhout-Spark-Zeppelin interpreter under Zeppelin project
if that's acceptable to the Zeppelin folks, even though most of it might be
repeatee.
What do others have to say?
have a good holiday weekend,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Thx Trevor,
Re: m-1854, It was something that we started when were first discussing
using the smile plots for and trying to pipe them over to Zeppelin ..
As
far as I know there was not progress started on it.. I've unassigned it.
Feel free to Assign any Jiras to yourself. I think that m-1854 is
similar
to the mahout-spark-shell, so I may be able to help out there.
________________________________________
Sent: Saturday, May 28, 2016 11:21:44 PM
Subject: Re: Future Mahout - Zeppelin work
Created a subtask on 1855 for tsv strings.
Looking at 1854 assigned to Pat Ferrel, what's your progress to date?
How
can I help?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Great!
When you free up and have the time, could you create some Jiras for
these?
We actually have MAHOUT-1852 open for Histograms already, and
MAHOUT-1854
and MAHOUT-1855 (early Zeppelin integration Jiras). I can close m-1854
and
m-1855 out and we can start new ones if they're not relevant anymore or
we
can just go with those.
Thanks
________________________________________
Sent: Thursday, May 26, 2016 3:17:22 PM
Subject: Re: Future Mahout - Zeppelin work
Short answer: it is high priority. I think it will be a Mahout
interpreter
into Zeppelin, and given that plans are on hold for a Flink-Mahout in
the
short term, I think it should be a piggy-back spark interpreter (e.g.
exposed through something like %spark.mahout). So I have thoughts,
but
no
plan. Been busy with a couple of other commitments.
A function that will convert small matrices into TSV strings
Convenience functions for sampling super-large matrices into things
like
histograms, etc, that one would want to plot. I.e. histogram bucketing?
(less important for the moment)
an interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
While on this subject, do we have a plan yet of integrating Zeppelin
into
Mahout (or the converse) of having Mahout specific interpreter for
Post by Suneel Marthi
Zeppelin? I think that shuld be high priority in the short term.
On Thu, May 26, 2016 at 1:17 PM, Trevor Grant <
Ahh, like the "Sample From Matrix" paragraph in the notebook.
Post by Trevor Grant
Yea that seems like a good add. If not this afternoon, I'll include
it
Saturday.
Post by Suneel Marthi
Post by Trevor Grant
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
On Thu, May 26, 2016 at 11:52 AM, Andrew Palumbo <
Trevor, I was reading over your blog last night again- first time
since
you updated. It is great!
Post by Trevor Grant
I have one suggestion being adding in a code line on how the the
sampling
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148
Post by Suneel Marthi
Post by Trevor Grant
mxSin = drmSampleKRows(drmSin, 1000, replacement = false)
Maybe you omitted this intentionally?
Andy
________________________________________
Sent: Friday, May 20, 2016 7:56:20 PM
Subject: Re: Future Mahout - Zeppelin work
Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a
version
is
Post by Trevor Grant
uninformative to me. I'd say if possible, you're first
troubleshooting
measure would be to re clone or do a "git fetch upstream" to get
Post by Trevor Grant
up
to
Post by Suneel Marthi
the
Post by Trevor Grant
very latest
Sorry for delayed reply
Tg
On May 20, 2016 5:36 PM, "Andrew Musselman" <
Post by Andrew Musselman
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the
dependencies;
is
Post by Trevor Grant
that a feature in more modern zep?
Post by Andrew Musselman
On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov <
no parenthesis.
Post by Dmitriy Lyubimov
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi <
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
Post by Trevor Grant
Hey Pat,
Post by Trevor Grant
If you spit out a TSV - you can import into pyspark /
matplotlib
from
Post by Andrew Musselman
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
resource pool in essentially the same way and use that
plotting
library
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
if
Post by Trevor Grant
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas
and
use
Post by Suneel Marthi
Post by Trevor Grant
all
Post by Andrew Musselman
of
Post by Dmitriy Lyubimov
Post by Trevor Grant
the pandas plotting as well (though I think it is for the
most
part,
Post by Trevor Grant
Post by Andrew Musselman
also
Post by Dmitriy Lyubimov
Post by Trevor Grant
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
In Zeppelin, unless you specify otherwise, pyspark,
sparkr,
spark-sql,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
Post by Trevor Grant
scala-spark all share the same spark context you can
create
RDDs
Post by Suneel Marthi
Post by Trevor Grant
in
one
Post by Andrew Musselman
Post by Dmitriy Lyubimov
language and access them / work on them in another (so I
Post by Trevor Grant
understand).
Post by Trevor Grant
So in Mahout can you "save" a matrix as a RDD? e.g.
something
like
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
Agreed.
Post by Pat Ferrel
BTW I don’t want to stall progress but being the most
ignorant
of
plot
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
libs, I’ll ask if we should consider python and
Post by Trevor Grant
matplotlib.
In
Post by Suneel Marthi
Post by Trevor Grant
another
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
project we use python because of the RDD support on
Post by Trevor Grant
Spark
though
Post by Suneel Marthi
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
visualizations are extremely limited in our case. If we
Post by Trevor Grant
Post by Trevor Grant
can
pass
Post by Suneel Marthi
Post by Trevor Grant
an
Post by Andrew Musselman
RDD
Post by Dmitriy Lyubimov
Post by Trevor Grant
to
Post by Trevor Grant
Post by Pat Ferrel
pyspark it would allow custom reductions in python
before
plotting,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
even
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
though we will support many natively in Mahout. I’m
guessing
that
Post by Trevor Grant
this
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
would cross a context boundary and require a write to
Post by Trevor Grant
disk?
Post by Pat Ferrel
1) what does the inter language support look like with
Spark
python
Post by Trevor Grant
Post by Andrew Musselman
vs
Post by Dmitriy Lyubimov
Post by Trevor Grant
SparkR, can we transfer RDDs?
Post by Trevor Grant
Post by Pat Ferrel
2) are the plot libs significantly different?
On May 20, 2016, at 9:54 AM, Trevor Grant <
Dmitriy really nailed it on the head in his reply to
the
post
Post by Suneel Marthi
which
Post by Trevor Grant
Post by Andrew Musselman
I'll
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
rebroadcast below. In essence the whole reason you are
(theoretically)
using Mahout is the data is to big to fit in memory.
Post by Trevor Grant
If
it's
Post by Suneel Marthi
to
Post by Trevor Grant
big
Post by Andrew Musselman
Post by Dmitriy Lyubimov
to
Post by Trevor Grant
fit
Post by Trevor Grant
Post by Pat Ferrel
in memory, well then its probably too big to plot each
point
(e.g.
Post by Trevor Grant
Post by Andrew Musselman
trillions of row, you only have so many pixels). For
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
the
example
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
I
Post by Dmitriy Lyubimov
randomly sampled a matrix.
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
So as Dmitriy says, in Mahout we need to have functions
that
will
Post by Trevor Grant
'preprocess' the data into something plotable.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
For the Zepplin-Plotting thing, we need to have a
function
that
Post by Suneel Marthi
Post by Trevor Grant
will
Post by Andrew Musselman
Post by Dmitriy Lyubimov
spit
Post by Trevor Grant
Post by Trevor Grant
out a tsv like string of the data we wanted plotted.
Post by Pat Ferrel
I agree an honest Mahout interpreter in Zeppelin is
probably
worth
Post by Trevor Grant
Post by Andrew Musselman
doing.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
There are a couple of ways to go about it. I opened up
the
discussion
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
on
Post by Trevor Grant
Post by Trevor Grant
take
that
Post by Suneel Marthi
Post by Trevor Grant
to
Post by Andrew Musselman
mean
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
we
Post by Pat Ferrel
can do it in a way that makes the most sense to Mahout
users...
Post by Pat Ferrel
First steps are to include some methods in Mahout that
will
do
Post by Suneel Marthi
Post by Trevor Grant
that
Post by Andrew Musselman
preprocessing, and one that will turn something into a
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
tsv
string.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
I have some general ideas on possible approached to
making
an
Post by Suneel Marthi
honest-mahout
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
interpreter but I want to play in the code and look at
the
Flink-Mahout
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
shell a bit before I try to organize my thoughts and
Post by Trevor Grant
present
them.
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
...(2) not sure what is the point of supporting
distributed
anything.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
It
Post by Trevor Grant
Post by Trevor Grant
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it in
memory.
Therefore,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
plotting anything distributed potentially presents 2
storage
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
space and overplotting due to number of points. The
Post by Trevor Grant
idea
is
that
Post by Suneel Marthi
Post by Trevor Grant
we
Post by Andrew Musselman
have
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
to
Post by Pat Ferrel
work out algorithms that condense big data information
into
small
Post by Suneel Marthi
Post by Trevor Grant
plottable
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
information (like density grids, for example, or
histograms)....
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth
out
the
Post by Suneel Marthi
sharp
Post by Trevor Grant
Post by Andrew Musselman
edges
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
and
Post by Pat Ferrel
Post by Pat Ferrel
any guidance from you or the Zeppelin community will
be a
big
Post by Suneel Marthi
Post by Trevor Grant
help.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
On May 20, 2016, at 8:13 AM, Shannon Quinn <
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try
this
in
Post by Trevor Grant
zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
but I just read the blog which is great!
Post by Pat Ferrel
Post by Pat Ferrel
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar
that I
mentioned
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on
an
adventure
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
getting
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin to work right after I accidently updated
to
the
Post by Suneel Marthi
new
Post by Trevor Grant
snapshot
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
(free
Post by Pat Ferrel
Post by Pat Ferrel
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md
now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes
of
things."
Post by Suneel Marthi
Post by Trevor Grant
-Virgil*
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to
look
at
Post by Suneel Marthi
Post by Trevor Grant
it
closely
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
yet
Post by Pat Ferrel
Post by Pat Ferrel
but just a small point: I believe that you'll also
need
to
Post by Trevor Grant
add
Post by Andrew Musselman
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats,
confusion
matrix,
Post by Suneel Marthi
Post by Trevor Grant
and
Post by Andrew Musselman
Post by Dmitriy Lyubimov
t-digest.
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's
comments
below,
Post by Trevor Grant
Post by Andrew Musselman
however
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
with
Post by Pat Ferrel
Post by Pat Ferrel
out further ado, I present two notebooks that
integrate
Mahout +
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Spark
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
+
Post by Pat Ferrel
Zeppelin + ggplot2
Eric Charles
2016-06-01 16:00:29 UTC
Permalink
Hi Suneel, an independent makes sense as mahout is supposed to run on
various backend, so not only spark.

Yes, I am following mahout mailing list (and not abroad this year - this
may change in the future).
Post by Suneel Marthi
Hi Eric,
We r talking about the same PR which is a tweak of existing Spark-Zeppelin
interpreter.
What we r looking at is a specific Mahout-Spark-Zeppelin interpreter that
is independent of above?
BTW Eric, nice to see u on Mahout mailing lists, u didn't make it to
Vancouver this time?
Post by Eric Charles
Have you seen [ZEPPELIN-116] Add Mahout Support for Spark Interpreter?
https://github.com/apache/incubator-zeppelin/pull/928
It declares in the spark interpreter the mahout deps, and creates the sdc
(spark distributed context).
OK cool. Just wanted to make sure I wasn't stealing anyone's baby or
duplicating efforts.
1- The blog post referenced the linear-regression example notebook twice-
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
(I still need to update with a blurb about sampling, however it is done in
that note...) So to any who tried the blog, I huge appology because that
notebook is where all of the 'magic happened', (all of the screen shots /
gg-plots / etc happened there).
https://github.com/rawkintrevo/incubator-zeppelin
if you build, and set 'spark.mahout' to 'true' in the Spark Interpretter
properties, you have a Mahout interpreter. This is the minimally invasive
way to do it, I'll be opening a PR soon, we'll see what the gang over at
Zeppelin say.
I'll still need docs and an example notebook, but I'm waiting to make sure
I don't need to do a major refactor before I get carried away with those
activities.
In essence when 'spark-mahout' is 'true' you jump right in on r-like dsl
and you have a sdc declared based on the underlying sc.
I am not sure if messing with the very "sacrosanct" Zeppelin-Spark
interpreter is gonna go down well with the Spark insanity. I would prefer
having a separate MAhout-Spark-Zeppelin interpreter under Zeppelin project
if that's acceptable to the Zeppelin folks, even though most of it might be
repeatee.
What do others have to say?
have a good holiday weekend,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Thx Trevor,
Re: m-1854, It was something that we started when were first discussing
using the smile plots for and trying to pipe them over to Zeppelin ..
As
far as I know there was not progress started on it.. I've unassigned it.
Feel free to Assign any Jiras to yourself. I think that m-1854 is
similar
to the mahout-spark-shell, so I may be able to help out there.
________________________________________
Sent: Saturday, May 28, 2016 11:21:44 PM
Subject: Re: Future Mahout - Zeppelin work
Created a subtask on 1855 for tsv strings.
Looking at 1854 assigned to Pat Ferrel, what's your progress to date?
How
can I help?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Great!
When you free up and have the time, could you create some Jiras for
these?
We actually have MAHOUT-1852 open for Histograms already, and
MAHOUT-1854
and MAHOUT-1855 (early Zeppelin integration Jiras). I can close m-1854
and
m-1855 out and we can start new ones if they're not relevant anymore or
we
can just go with those.
Thanks
________________________________________
Sent: Thursday, May 26, 2016 3:17:22 PM
Subject: Re: Future Mahout - Zeppelin work
Short answer: it is high priority. I think it will be a Mahout
interpreter
into Zeppelin, and given that plans are on hold for a Flink-Mahout in
the
short term, I think it should be a piggy-back spark interpreter (e.g.
exposed through something like %spark.mahout). So I have thoughts,
but
no
plan. Been busy with a couple of other commitments.
A function that will convert small matrices into TSV strings
Convenience functions for sampling super-large matrices into things
like
histograms, etc, that one would want to plot. I.e. histogram bucketing?
(less important for the moment)
an interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
While on this subject, do we have a plan yet of integrating Zeppelin
into
Mahout (or the converse) of having Mahout specific interpreter for
Post by Suneel Marthi
Zeppelin? I think that shuld be high priority in the short term.
On Thu, May 26, 2016 at 1:17 PM, Trevor Grant <
Ahh, like the "Sample From Matrix" paragraph in the notebook.
Post by Trevor Grant
Yea that seems like a good add. If not this afternoon, I'll include
it
Saturday.
Post by Suneel Marthi
Post by Trevor Grant
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
On Thu, May 26, 2016 at 11:52 AM, Andrew Palumbo <
Trevor, I was reading over your blog last night again- first time
since
you updated. It is great!
Post by Trevor Grant
I have one suggestion being adding in a code line on how the the
sampling
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148
Post by Suneel Marthi
Post by Trevor Grant
mxSin = drmSampleKRows(drmSin, 1000, replacement = false)
Maybe you omitted this intentionally?
Andy
________________________________________
Sent: Friday, May 20, 2016 7:56:20 PM
Subject: Re: Future Mahout - Zeppelin work
Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a
version
is
Post by Trevor Grant
uninformative to me. I'd say if possible, you're first
troubleshooting
measure would be to re clone or do a "git fetch upstream" to get
Post by Trevor Grant
up
to
Post by Suneel Marthi
the
Post by Trevor Grant
very latest
Sorry for delayed reply
Tg
On May 20, 2016 5:36 PM, "Andrew Musselman" <
Post by Andrew Musselman
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the
dependencies;
is
Post by Trevor Grant
that a feature in more modern zep?
Post by Andrew Musselman
On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov <
no parenthesis.
Post by Dmitriy Lyubimov
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi <
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
Post by Trevor Grant
Hey Pat,
Post by Trevor Grant
If you spit out a TSV - you can import into pyspark /
matplotlib
from
Post by Andrew Musselman
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
resource pool in essentially the same way and use that
plotting
library
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
if
Post by Trevor Grant
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas
and
use
Post by Suneel Marthi
Post by Trevor Grant
all
Post by Andrew Musselman
of
Post by Dmitriy Lyubimov
Post by Trevor Grant
the pandas plotting as well (though I think it is for the
most
part,
Post by Trevor Grant
Post by Andrew Musselman
also
Post by Dmitriy Lyubimov
Post by Trevor Grant
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
In Zeppelin, unless you specify otherwise, pyspark,
sparkr,
spark-sql,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
Post by Trevor Grant
scala-spark all share the same spark context you can
create
RDDs
Post by Suneel Marthi
Post by Trevor Grant
in
one
Post by Andrew Musselman
Post by Dmitriy Lyubimov
language and access them / work on them in another (so I
Post by Trevor Grant
understand).
Post by Trevor Grant
So in Mahout can you "save" a matrix as a RDD? e.g.
something
like
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
Agreed.
Post by Pat Ferrel
BTW I don’t want to stall progress but being the most
ignorant
of
plot
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
libs, I’ll ask if we should consider python and
Post by Trevor Grant
matplotlib.
In
Post by Suneel Marthi
Post by Trevor Grant
another
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
project we use python because of the RDD support on
Post by Trevor Grant
Spark
though
Post by Suneel Marthi
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
visualizations are extremely limited in our case. If we
Post by Trevor Grant
Post by Trevor Grant
can
pass
Post by Suneel Marthi
Post by Trevor Grant
an
Post by Andrew Musselman
RDD
Post by Dmitriy Lyubimov
Post by Trevor Grant
to
Post by Trevor Grant
Post by Pat Ferrel
pyspark it would allow custom reductions in python
before
plotting,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
even
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
though we will support many natively in Mahout. I’m
guessing
that
Post by Trevor Grant
this
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
would cross a context boundary and require a write to
Post by Trevor Grant
disk?
Post by Pat Ferrel
1) what does the inter language support look like with
Spark
python
Post by Trevor Grant
Post by Andrew Musselman
vs
Post by Dmitriy Lyubimov
Post by Trevor Grant
SparkR, can we transfer RDDs?
Post by Trevor Grant
Post by Pat Ferrel
2) are the plot libs significantly different?
On May 20, 2016, at 9:54 AM, Trevor Grant <
Dmitriy really nailed it on the head in his reply to
the
post
Post by Suneel Marthi
which
Post by Trevor Grant
Post by Andrew Musselman
I'll
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
rebroadcast below. In essence the whole reason you are
(theoretically)
using Mahout is the data is to big to fit in memory.
Post by Trevor Grant
If
it's
Post by Suneel Marthi
to
Post by Trevor Grant
big
Post by Andrew Musselman
Post by Dmitriy Lyubimov
to
Post by Trevor Grant
fit
Post by Trevor Grant
Post by Pat Ferrel
in memory, well then its probably too big to plot each
point
(e.g.
Post by Trevor Grant
Post by Andrew Musselman
trillions of row, you only have so many pixels). For
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
the
example
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
I
Post by Dmitriy Lyubimov
randomly sampled a matrix.
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
So as Dmitriy says, in Mahout we need to have functions
that
will
Post by Trevor Grant
'preprocess' the data into something plotable.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
For the Zepplin-Plotting thing, we need to have a
function
that
Post by Suneel Marthi
Post by Trevor Grant
will
Post by Andrew Musselman
Post by Dmitriy Lyubimov
spit
Post by Trevor Grant
Post by Trevor Grant
out a tsv like string of the data we wanted plotted.
Post by Pat Ferrel
I agree an honest Mahout interpreter in Zeppelin is
probably
worth
Post by Trevor Grant
Post by Andrew Musselman
doing.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
There are a couple of ways to go about it. I opened up
the
discussion
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
on
Post by Trevor Grant
Post by Trevor Grant
take
that
Post by Suneel Marthi
Post by Trevor Grant
to
Post by Andrew Musselman
mean
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
we
Post by Pat Ferrel
can do it in a way that makes the most sense to Mahout
users...
Post by Pat Ferrel
First steps are to include some methods in Mahout that
will
do
Post by Suneel Marthi
Post by Trevor Grant
that
Post by Andrew Musselman
preprocessing, and one that will turn something into a
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
tsv
string.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
I have some general ideas on possible approached to
making
an
Post by Suneel Marthi
honest-mahout
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
interpreter but I want to play in the code and look at
the
Flink-Mahout
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
shell a bit before I try to organize my thoughts and
Post by Trevor Grant
present
them.
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
...(2) not sure what is the point of supporting
distributed
anything.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
It
Post by Trevor Grant
Post by Trevor Grant
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it in
memory.
Therefore,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
plotting anything distributed potentially presents 2
storage
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
space and overplotting due to number of points. The
Post by Trevor Grant
idea
is
that
Post by Suneel Marthi
Post by Trevor Grant
we
Post by Andrew Musselman
have
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
to
Post by Pat Ferrel
work out algorithms that condense big data information
into
small
Post by Suneel Marthi
Post by Trevor Grant
plottable
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
information (like density grids, for example, or
histograms)....
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth
out
the
Post by Suneel Marthi
sharp
Post by Trevor Grant
Post by Andrew Musselman
edges
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
and
Post by Pat Ferrel
Post by Pat Ferrel
any guidance from you or the Zeppelin community will
be a
big
Post by Suneel Marthi
Post by Trevor Grant
help.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
On May 20, 2016, at 8:13 AM, Shannon Quinn <
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try
this
in
Post by Trevor Grant
zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
but I just read the blog which is great!
Post by Pat Ferrel
Post by Pat Ferrel
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar
that I
mentioned
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on
an
adventure
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
getting
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin to work right after I accidently updated
to
the
Post by Suneel Marthi
new
Post by Trevor Grant
snapshot
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
(free
Post by Pat Ferrel
Post by Pat Ferrel
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md
now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes
of
things."
Post by Suneel Marthi
Post by Trevor Grant
-Virgil*
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to
look
at
Post by Suneel Marthi
Post by Trevor Grant
it
closely
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
yet
Post by Pat Ferrel
Post by Pat Ferrel
but just a small point: I believe that you'll also
need
to
Post by Trevor Grant
add
Post by Andrew Musselman
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats,
confusion
matrix,
Post by Suneel Marthi
Post by Trevor Grant
and
Post by Andrew Musselman
Post by Dmitriy Lyubimov
t-digest.
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's
comments
below,
Post by Trevor Grant
Post by Andrew Musselman
however
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
with
Post by Pat Ferrel
Post by Pat Ferrel
out further ado, I present two notebooks that
integrate
Mahout +
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Spark
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
+
Post by Pat Ferrel
Zeppelin + ggplot2
Trevor Grant
2016-06-01 16:17:15 UTC
Permalink
Hey Eric,

The 'piggyback' or 'patch' approach is a lot easier and less invasive to
implement in practice, and has the Zeppelin community blessing.

When the Flink version comes on line, it will be also super easy to
replicate the effort. And even doing two (or more) 'piggybacks' will be
easier to maintain than one stand-alone Mahout interpretter. Also,
'piggybacking' opens up the possibility of sharing between contexts,
minimizes user configuration, etc.

The differential is about 20 new lines of code for a piggy back on any
underlying engine, vs. about 300 lines of code for a stand alone
interpreter which must be kept up to date with its Spark/Flink counter
parts.

Philosophically the stand-alone makes sense, practically the piggyback
does. *shruggie*

It is possible that somewhere down the road we'll refactor the piggy
back(s) into a stand alone interpreter, at which point none of the current
effort will be wasted, it will just be moving some code around. So the
other advantage to the piggyback is that it quickly fields a minimum viable
product, with out having to pay much for it later on down the road.

This is in part due to the way Zeppelin implemented its interpreters which
involves a lot of code repetition.

I'm open to further discussion, but after playing in the Zeppelin code for
a while and really groking different approaches I think this one is best. I
do invite critiques because I believe I have considered most angles and can
properly defend the current path, and if there is something I haven't
thought of, I'd rather it be brought to light sooner than later.

tg


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Eric Charles
Hi Suneel, an independent makes sense as mahout is supposed to run on
various backend, so not only spark.
Yes, I am following mahout mailing list (and not abroad this year - this
may change in the future).
Post by Suneel Marthi
Hi Eric,
We r talking about the same PR which is a tweak of existing Spark-Zeppelin
interpreter.
What we r looking at is a specific Mahout-Spark-Zeppelin interpreter that
is independent of above?
BTW Eric, nice to see u on Mahout mailing lists, u didn't make it to
Vancouver this time?
Have you seen [ZEPPELIN-116] Add Mahout Support for Spark Interpreter?
Post by Eric Charles
https://github.com/apache/incubator-zeppelin/pull/928
It declares in the spark interpreter the mahout deps, and creates the sdc
(spark distributed context).
OK cool. Just wanted to make sure I wasn't stealing anyone's baby or
duplicating efforts.
1- The blog post referenced the linear-regression example notebook twice-
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
(I still need to update with a blurb about sampling, however it is done in
that note...) So to any who tried the blog, I huge appology because that
notebook is where all of the 'magic happened', (all of the screen shots /
gg-plots / etc happened there).
https://github.com/rawkintrevo/incubator-zeppelin
if you build, and set 'spark.mahout' to 'true' in the Spark Interpretter
properties, you have a Mahout interpreter. This is the minimally invasive
way to do it, I'll be opening a PR soon, we'll see what the gang over at
Zeppelin say.
I'll still need docs and an example notebook, but I'm waiting to make sure
I don't need to do a major refactor before I get carried away with those
activities.
In essence when 'spark-mahout' is 'true' you jump right in on r-like dsl
and you have a sdc declared based on the underlying sc.
I am not sure if messing with the very "sacrosanct" Zeppelin-Spark
interpreter is gonna go down well with the Spark insanity. I would prefer
having a separate MAhout-Spark-Zeppelin interpreter under Zeppelin project
if that's acceptable to the Zeppelin folks, even though most of it might be
repeatee.
What do others have to say?
have a good holiday weekend,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Thx Trevor,
Re: m-1854, It was something that we started when were first discussing
using the smile plots for and trying to pipe them over to Zeppelin ..
As
far as I know there was not progress started on it.. I've unassigned it.
Feel free to Assign any Jiras to yourself. I think that m-1854 is
similar
to the mahout-spark-shell, so I may be able to help out there.
________________________________________
Sent: Saturday, May 28, 2016 11:21:44 PM
Subject: Re: Future Mahout - Zeppelin work
Created a subtask on 1855 for tsv strings.
Looking at 1854 assigned to Pat Ferrel, what's your progress to date?
How
can I help?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Great!
When you free up and have the time, could you create some Jiras for
these?
We actually have MAHOUT-1852 open for Histograms already, and
MAHOUT-1854
and MAHOUT-1855 (early Zeppelin integration Jiras). I can close m-1854
and
m-1855 out and we can start new ones if they're not relevant anymore
or
we
can just go with those.
Thanks
________________________________________
Sent: Thursday, May 26, 2016 3:17:22 PM
Subject: Re: Future Mahout - Zeppelin work
Short answer: it is high priority. I think it will be a Mahout
interpreter
into Zeppelin, and given that plans are on hold for a Flink-Mahout in
the
short term, I think it should be a piggy-back spark interpreter (e.g.
exposed through something like %spark.mahout). So I have thoughts,
but
no
plan. Been busy with a couple of other commitments.
A function that will convert small matrices into TSV strings
Convenience functions for sampling super-large matrices into things
like
histograms, etc, that one would want to plot. I.e. histogram bucketing?
(less important for the moment)
an interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
While on this subject, do we have a plan yet of integrating Zeppelin
Post by Suneel Marthi
into
Mahout (or the converse) of having Mahout specific interpreter for
Post by Suneel Marthi
Zeppelin? I think that shuld be high priority in the short term.
On Thu, May 26, 2016 at 1:17 PM, Trevor Grant <
Ahh, like the "Sample From Matrix" paragraph in the notebook.
Post by Trevor Grant
Yea that seems like a good add. If not this afternoon, I'll include
it
Saturday.
Post by Suneel Marthi
Post by Trevor Grant
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Thu, May 26, 2016 at 11:52 AM, Andrew Palumbo <
Trevor, I was reading over your blog last night again- first time
since
you updated. It is great!
Post by Suneel Marthi
Post by Trevor Grant
I have one suggestion being adding in a code line on how the the
sampling
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148
Post by Suneel Marthi
Post by Trevor Grant
mxSin = drmSampleKRows(drmSin, 1000, replacement = false)
Maybe you omitted this intentionally?
Andy
________________________________________
Sent: Friday, May 20, 2016 7:56:20 PM
Subject: Re: Future Mahout - Zeppelin work
Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a
version
is
Post by Suneel Marthi
Post by Trevor Grant
uninformative to me. I'd say if possible, you're first
troubleshooting
measure would be to re clone or do a "git fetch upstream" to get
Post by Suneel Marthi
Post by Trevor Grant
up
to
the
Post by Suneel Marthi
Post by Trevor Grant
very latest
Sorry for delayed reply
Tg
On May 20, 2016 5:36 PM, "Andrew Musselman" <
Post by Andrew Musselman
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the
dependencies;
is
Post by Suneel Marthi
Post by Trevor Grant
that a feature in more modern zep?
Post by Andrew Musselman
On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov <
no parenthesis.
Post by Dmitriy Lyubimov
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi <
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
Post by Trevor Grant
Hey Pat,
Post by Trevor Grant
If you spit out a TSV - you can import into pyspark /
matplotlib
from
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
resource pool in essentially the same way and use that
Post by Trevor Grant
Post by Trevor Grant
plotting
library
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
if
Post by Dmitriy Lyubimov
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas
Post by Trevor Grant
and
use
Post by Suneel Marthi
all
Post by Trevor Grant
of
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the pandas plotting as well (though I think it is for the
Post by Trevor Grant
Post by Trevor Grant
most
part,
Post by Suneel Marthi
Post by Trevor Grant
also
Post by Andrew Musselman
Post by Dmitriy Lyubimov
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
In Zeppelin, unless you specify otherwise, pyspark,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
sparkr,
spark-sql,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
and
Post by Dmitriy Lyubimov
Post by Trevor Grant
scala-spark all share the same spark context you can
Post by Trevor Grant
create
RDDs
Post by Suneel Marthi
in
Post by Trevor Grant
one
Post by Andrew Musselman
language and access them / work on them in another (so I
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
understand).
So in Mahout can you "save" a matrix as a RDD? e.g.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
something
like
Post by Suneel Marthi
Post by Trevor Grant
val myRDD = myDRM.asRDD()
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
val myRDD = myDRM.rdd()
And would 'myRDD' then exist in the spark context?
Post by Trevor Grant
yes it will be in sparkContext
Trevor Grant
Post by Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
Agreed.
Post by Pat Ferrel
BTW I don’t want to stall progress but being the most
ignorant
of
Post by Trevor Grant
plot
Post by Andrew Musselman
Post by Dmitriy Lyubimov
libs, I’ll ask if we should consider python and
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
matplotlib.
In
Post by Suneel Marthi
another
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
project we use python because of the RDD support on
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Spark
though
Post by Suneel Marthi
Post by Trevor Grant
the
Post by Andrew Musselman
visualizations are extremely limited in our case. If we
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
can
pass
Post by Suneel Marthi
Post by Trevor Grant
an
RDD
Post by Andrew Musselman
Post by Dmitriy Lyubimov
to
Post by Trevor Grant
Post by Trevor Grant
pyspark it would allow custom reductions in python
Post by Pat Ferrel
before
plotting,
Post by Suneel Marthi
Post by Trevor Grant
even
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
though we will support many natively in Mahout. I’m
Post by Trevor Grant
Post by Pat Ferrel
guessing
that
Post by Suneel Marthi
Post by Trevor Grant
this
Post by Andrew Musselman
Post by Dmitriy Lyubimov
would cross a context boundary and require a write to
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
disk?
1) what does the inter language support look like with
Spark
python
Post by Suneel Marthi
Post by Trevor Grant
vs
Post by Andrew Musselman
Post by Dmitriy Lyubimov
SparkR, can we transfer RDDs?
Post by Trevor Grant
Post by Trevor Grant
2) are the plot libs significantly different?
Post by Pat Ferrel
On May 20, 2016, at 9:54 AM, Trevor Grant <
Dmitriy really nailed it on the head in his reply to
the
post
which
Post by Suneel Marthi
Post by Trevor Grant
I'll
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
rebroadcast below. In essence the whole reason you are
Post by Trevor Grant
Post by Pat Ferrel
(theoretically)
using Mahout is the data is to big to fit in memory.
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
If
it's
to
Post by Suneel Marthi
Post by Trevor Grant
big
Post by Andrew Musselman
to
Post by Dmitriy Lyubimov
fit
Post by Trevor Grant
Post by Trevor Grant
in memory, well then its probably too big to plot each
Post by Pat Ferrel
point
(e.g.
Post by Suneel Marthi
Post by Trevor Grant
trillions of row, you only have so many pixels). For
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
the
example
Post by Suneel Marthi
Post by Trevor Grant
I
Post by Andrew Musselman
randomly sampled a matrix.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
So as Dmitriy says, in Mahout we need to have functions
that
will
Post by Suneel Marthi
Post by Trevor Grant
'preprocess' the data into something plotable.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
For the Zepplin-Plotting thing, we need to have a
function
that
Post by Suneel Marthi
will
Post by Trevor Grant
Post by Andrew Musselman
spit
Post by Dmitriy Lyubimov
Post by Trevor Grant
out a tsv like string of the data we wanted plotted.
Post by Trevor Grant
Post by Pat Ferrel
I agree an honest Mahout interpreter in Zeppelin is
probably
worth
Post by Suneel Marthi
Post by Trevor Grant
doing.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
There are a couple of ways to go about it. I opened up
Post by Trevor Grant
Post by Pat Ferrel
the
discussion
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
on
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
take
that
Post by Suneel Marthi
Post by Trevor Grant
to
mean
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
we
Post by Trevor Grant
can do it in a way that makes the most sense to Mahout
Post by Pat Ferrel
users...
First steps are to include some methods in Mahout that
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
will
do
Post by Suneel Marthi
that
Post by Trevor Grant
preprocessing, and one that will turn something into a
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
tsv
string.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
I have some general ideas on possible approached to
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
making
an
honest-mahout
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
interpreter but I want to play in the code and look at
Post by Pat Ferrel
the
Flink-Mahout
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
shell a bit before I try to organize my thoughts and
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
present
them.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
...(2) not sure what is the point of supporting
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
distributed
anything.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
It
Post by Dmitriy Lyubimov
Post by Trevor Grant
is
Post by Trevor Grant
distributed presumably because it is hard to keep it in
Post by Pat Ferrel
memory.
Therefore,
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
plotting anything distributed potentially presents 2
storage
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
space and overplotting due to number of points. The
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
idea
is
that
Post by Suneel Marthi
Post by Trevor Grant
we
have
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
to
Post by Trevor Grant
work out algorithms that condense big data information
Post by Pat Ferrel
into
small
Post by Suneel Marthi
Post by Trevor Grant
plottable
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
information (like density grids, for example, or
Post by Pat Ferrel
histograms)....
Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth
Post by Pat Ferrel
out
the
sharp
Post by Suneel Marthi
Post by Trevor Grant
edges
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
and
Post by Trevor Grant
Post by Pat Ferrel
any guidance from you or the Zeppelin community will
Post by Pat Ferrel
be a
big
Post by Suneel Marthi
help.
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
On May 20, 2016, at 8:13 AM, Shannon Quinn <
Agreed, thoroughly enjoying the blog post.
Post by Pat Ferrel
Post by Pat Ferrel
Well done, Trevor! I've not yet had a chance to try
Post by Pat Ferrel
this
in
Post by Suneel Marthi
zeppelin
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
but I just read the blog which is great!
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar
Post by Trevor Grant
that I
mentioned
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
(In the spark module that is)
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Trevor Grant
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on
an
adventure
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
in
Post by Dmitriy Lyubimov
getting
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin to work right after I accidently updated
Post by Pat Ferrel
Post by Trevor Grant
to
the
new
Post by Suneel Marthi
Post by Trevor Grant
snapshot
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
(free
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
hint: the secret was to clear my cache *face-palm*)
Post by Pat Ferrel
Post by Trevor Grant
I'm going to add that dependency to the readme.md
now.
thanks,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Trevor Grant
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes
of
things."
Post by Suneel Marthi
Post by Trevor Grant
-Virgil*
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Trevor Grant
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to
Post by Pat Ferrel
look
at
Post by Suneel Marthi
it
Post by Trevor Grant
closely
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
yet
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
but just a small point: I believe that you'll also
Post by Pat Ferrel
Post by Trevor Grant
Post by Pat Ferrel
need
to
Post by Suneel Marthi
add
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Trevor Grant
Post by Pat Ferrel
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats,
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Trevor Grant
Post by Pat Ferrel
confusion
matrix,
Post by Suneel Marthi
Post by Trevor Grant
and
Post by Andrew Musselman
t-digest.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Andy
Post by Trevor Grant
Post by Pat Ferrel
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Eric Charles
2016-06-01 16:28:06 UTC
Permalink
+1 piggybacking sounds reasonable and quick-win.
Post by Trevor Grant
Hey Eric,
The 'piggyback' or 'patch' approach is a lot easier and less invasive to
implement in practice, and has the Zeppelin community blessing.
When the Flink version comes on line, it will be also super easy to
replicate the effort. And even doing two (or more) 'piggybacks' will be
easier to maintain than one stand-alone Mahout interpretter. Also,
'piggybacking' opens up the possibility of sharing between contexts,
minimizes user configuration, etc.
The differential is about 20 new lines of code for a piggy back on any
underlying engine, vs. about 300 lines of code for a stand alone
interpreter which must be kept up to date with its Spark/Flink counter
parts.
Philosophically the stand-alone makes sense, practically the piggyback
does. *shruggie*
It is possible that somewhere down the road we'll refactor the piggy
back(s) into a stand alone interpreter, at which point none of the current
effort will be wasted, it will just be moving some code around. So the
other advantage to the piggyback is that it quickly fields a minimum viable
product, with out having to pay much for it later on down the road.
This is in part due to the way Zeppelin implemented its interpreters which
involves a lot of code repetition.
I'm open to further discussion, but after playing in the Zeppelin code for
a while and really groking different approaches I think this one is best. I
do invite critiques because I believe I have considered most angles and can
properly defend the current path, and if there is something I haven't
thought of, I'd rather it be brought to light sooner than later.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Eric Charles
Hi Suneel, an independent makes sense as mahout is supposed to run on
various backend, so not only spark.
Yes, I am following mahout mailing list (and not abroad this year - this
may change in the future).
Post by Suneel Marthi
Hi Eric,
We r talking about the same PR which is a tweak of existing Spark-Zeppelin
interpreter.
What we r looking at is a specific Mahout-Spark-Zeppelin interpreter that
is independent of above?
BTW Eric, nice to see u on Mahout mailing lists, u didn't make it to
Vancouver this time?
Have you seen [ZEPPELIN-116] Add Mahout Support for Spark Interpreter?
Post by Eric Charles
https://github.com/apache/incubator-zeppelin/pull/928
It declares in the spark interpreter the mahout deps, and creates the sdc
(spark distributed context).
OK cool. Just wanted to make sure I wasn't stealing anyone's baby or
duplicating efforts.
1- The blog post referenced the linear-regression example notebook twice-
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
(I still need to update with a blurb about sampling, however it is done in
that note...) So to any who tried the blog, I huge appology because that
notebook is where all of the 'magic happened', (all of the screen shots /
gg-plots / etc happened there).
https://github.com/rawkintrevo/incubator-zeppelin
if you build, and set 'spark.mahout' to 'true' in the Spark Interpretter
properties, you have a Mahout interpreter. This is the minimally invasive
way to do it, I'll be opening a PR soon, we'll see what the gang over at
Zeppelin say.
I'll still need docs and an example notebook, but I'm waiting to make sure
I don't need to do a major refactor before I get carried away with those
activities.
In essence when 'spark-mahout' is 'true' you jump right in on r-like dsl
and you have a sdc declared based on the underlying sc.
I am not sure if messing with the very "sacrosanct" Zeppelin-Spark
interpreter is gonna go down well with the Spark insanity. I would prefer
having a separate MAhout-Spark-Zeppelin interpreter under Zeppelin project
if that's acceptable to the Zeppelin folks, even though most of it might be
repeatee.
What do others have to say?
have a good holiday weekend,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Thx Trevor,
Re: m-1854, It was something that we started when were first discussing
using the smile plots for and trying to pipe them over to Zeppelin ..
As
far as I know there was not progress started on it.. I've unassigned it.
Feel free to Assign any Jiras to yourself. I think that m-1854 is
similar
to the mahout-spark-shell, so I may be able to help out there.
________________________________________
Sent: Saturday, May 28, 2016 11:21:44 PM
Subject: Re: Future Mahout - Zeppelin work
Created a subtask on 1855 for tsv strings.
Looking at 1854 assigned to Pat Ferrel, what's your progress to date?
How
can I help?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Great!
When you free up and have the time, could you create some Jiras for
these?
We actually have MAHOUT-1852 open for Histograms already, and
MAHOUT-1854
and MAHOUT-1855 (early Zeppelin integration Jiras). I can close m-1854
and
m-1855 out and we can start new ones if they're not relevant anymore
or
we
can just go with those.
Thanks
________________________________________
Sent: Thursday, May 26, 2016 3:17:22 PM
Subject: Re: Future Mahout - Zeppelin work
Short answer: it is high priority. I think it will be a Mahout
interpreter
into Zeppelin, and given that plans are on hold for a Flink-Mahout in
the
short term, I think it should be a piggy-back spark interpreter (e.g.
exposed through something like %spark.mahout). So I have thoughts,
but
no
plan. Been busy with a couple of other commitments.
A function that will convert small matrices into TSV strings
Convenience functions for sampling super-large matrices into things
like
histograms, etc, that one would want to plot. I.e. histogram bucketing?
(less important for the moment)
an interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
While on this subject, do we have a plan yet of integrating Zeppelin
Post by Suneel Marthi
into
Mahout (or the converse) of having Mahout specific interpreter for
Post by Suneel Marthi
Zeppelin? I think that shuld be high priority in the short term.
On Thu, May 26, 2016 at 1:17 PM, Trevor Grant <
Ahh, like the "Sample From Matrix" paragraph in the notebook.
Post by Trevor Grant
Yea that seems like a good add. If not this afternoon, I'll include
it
Saturday.
Post by Suneel Marthi
Post by Trevor Grant
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Thu, May 26, 2016 at 11:52 AM, Andrew Palumbo <
Trevor, I was reading over your blog last night again- first time
since
you updated. It is great!
Post by Suneel Marthi
Post by Trevor Grant
I have one suggestion being adding in a code line on how the the
sampling
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148
Post by Suneel Marthi
Post by Trevor Grant
mxSin = drmSampleKRows(drmSin, 1000, replacement = false)
Maybe you omitted this intentionally?
Andy
________________________________________
Sent: Friday, May 20, 2016 7:56:20 PM
Subject: Re: Future Mahout - Zeppelin work
Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a
version
is
Post by Suneel Marthi
Post by Trevor Grant
uninformative to me. I'd say if possible, you're first
troubleshooting
measure would be to re clone or do a "git fetch upstream" to get
Post by Suneel Marthi
Post by Trevor Grant
up
to
the
Post by Suneel Marthi
Post by Trevor Grant
very latest
Sorry for delayed reply
Tg
On May 20, 2016 5:36 PM, "Andrew Musselman" <
Post by Andrew Musselman
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the
dependencies;
is
Post by Suneel Marthi
Post by Trevor Grant
that a feature in more modern zep?
Post by Andrew Musselman
On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov <
no parenthesis.
Post by Dmitriy Lyubimov
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi <
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
Post by Trevor Grant
Hey Pat,
Post by Trevor Grant
If you spit out a TSV - you can import into pyspark /
matplotlib
from
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
resource pool in essentially the same way and use that
Post by Trevor Grant
Post by Trevor Grant
plotting
library
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
if
Post by Dmitriy Lyubimov
Post by Trevor Grant
you prefer. In fact you could import the tsv into pandas
Post by Trevor Grant
and
use
Post by Suneel Marthi
all
Post by Trevor Grant
of
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the pandas plotting as well (though I think it is for the
Post by Trevor Grant
Post by Trevor Grant
most
part,
Post by Suneel Marthi
Post by Trevor Grant
also
Post by Andrew Musselman
Post by Dmitriy Lyubimov
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
In Zeppelin, unless you specify otherwise, pyspark,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
sparkr,
spark-sql,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
and
Post by Dmitriy Lyubimov
Post by Trevor Grant
scala-spark all share the same spark context you can
Post by Trevor Grant
create
RDDs
Post by Suneel Marthi
in
Post by Trevor Grant
one
Post by Andrew Musselman
language and access them / work on them in another (so I
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
understand).
So in Mahout can you "save" a matrix as a RDD? e.g.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
something
like
Post by Suneel Marthi
Post by Trevor Grant
val myRDD = myDRM.asRDD()
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
val myRDD = myDRM.rdd()
And would 'myRDD' then exist in the spark context?
Post by Trevor Grant
yes it will be in sparkContext
Trevor Grant
Post by Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
Agreed.
Post by Pat Ferrel
BTW I don’t want to stall progress but being the most
ignorant
of
Post by Trevor Grant
plot
Post by Andrew Musselman
Post by Dmitriy Lyubimov
libs, I’ll ask if we should consider python and
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
matplotlib.
In
Post by Suneel Marthi
another
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
project we use python because of the RDD support on
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Spark
though
Post by Suneel Marthi
Post by Trevor Grant
the
Post by Andrew Musselman
visualizations are extremely limited in our case. If we
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
can
pass
Post by Suneel Marthi
Post by Trevor Grant
an
RDD
Post by Andrew Musselman
Post by Dmitriy Lyubimov
to
Post by Trevor Grant
Post by Trevor Grant
pyspark it would allow custom reductions in python
Post by Pat Ferrel
before
plotting,
Post by Suneel Marthi
Post by Trevor Grant
even
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
though we will support many natively in Mahout. I’m
Post by Trevor Grant
Post by Pat Ferrel
guessing
that
Post by Suneel Marthi
Post by Trevor Grant
this
Post by Andrew Musselman
Post by Dmitriy Lyubimov
would cross a context boundary and require a write to
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
disk?
1) what does the inter language support look like with
Spark
python
Post by Suneel Marthi
Post by Trevor Grant
vs
Post by Andrew Musselman
Post by Dmitriy Lyubimov
SparkR, can we transfer RDDs?
Post by Trevor Grant
Post by Trevor Grant
2) are the plot libs significantly different?
Post by Pat Ferrel
On May 20, 2016, at 9:54 AM, Trevor Grant <
Dmitriy really nailed it on the head in his reply to
the
post
which
Post by Suneel Marthi
Post by Trevor Grant
I'll
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
rebroadcast below. In essence the whole reason you are
Post by Trevor Grant
Post by Pat Ferrel
(theoretically)
using Mahout is the data is to big to fit in memory.
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
If
it's
to
Post by Suneel Marthi
Post by Trevor Grant
big
Post by Andrew Musselman
to
Post by Dmitriy Lyubimov
fit
Post by Trevor Grant
Post by Trevor Grant
in memory, well then its probably too big to plot each
Post by Pat Ferrel
point
(e.g.
Post by Suneel Marthi
Post by Trevor Grant
trillions of row, you only have so many pixels). For
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
the
example
Post by Suneel Marthi
Post by Trevor Grant
I
Post by Andrew Musselman
randomly sampled a matrix.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
So as Dmitriy says, in Mahout we need to have functions
that
will
Post by Suneel Marthi
Post by Trevor Grant
'preprocess' the data into something plotable.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
For the Zepplin-Plotting thing, we need to have a
function
that
Post by Suneel Marthi
will
Post by Trevor Grant
Post by Andrew Musselman
spit
Post by Dmitriy Lyubimov
Post by Trevor Grant
out a tsv like string of the data we wanted plotted.
Post by Trevor Grant
Post by Pat Ferrel
I agree an honest Mahout interpreter in Zeppelin is
probably
worth
Post by Suneel Marthi
Post by Trevor Grant
doing.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
There are a couple of ways to go about it. I opened up
Post by Trevor Grant
Post by Pat Ferrel
the
discussion
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
on
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
take
that
Post by Suneel Marthi
Post by Trevor Grant
to
mean
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
we
Post by Trevor Grant
can do it in a way that makes the most sense to Mahout
Post by Pat Ferrel
users...
First steps are to include some methods in Mahout that
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
will
do
Post by Suneel Marthi
that
Post by Trevor Grant
preprocessing, and one that will turn something into a
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
tsv
string.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
I have some general ideas on possible approached to
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
making
an
honest-mahout
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
interpreter but I want to play in the code and look at
Post by Pat Ferrel
the
Flink-Mahout
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
shell a bit before I try to organize my thoughts and
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
present
them.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
...(2) not sure what is the point of supporting
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
distributed
anything.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
It
Post by Dmitriy Lyubimov
Post by Trevor Grant
is
Post by Trevor Grant
distributed presumably because it is hard to keep it in
Post by Pat Ferrel
memory.
Therefore,
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
plotting anything distributed potentially presents 2
storage
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
space and overplotting due to number of points. The
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
idea
is
that
Post by Suneel Marthi
Post by Trevor Grant
we
have
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
to
Post by Trevor Grant
work out algorithms that condense big data information
Post by Pat Ferrel
into
small
Post by Suneel Marthi
Post by Trevor Grant
plottable
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
information (like density grids, for example, or
Post by Pat Ferrel
histograms)....
Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth
Post by Pat Ferrel
out
the
sharp
Post by Suneel Marthi
Post by Trevor Grant
edges
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
and
Post by Trevor Grant
Post by Pat Ferrel
any guidance from you or the Zeppelin community will
Post by Pat Ferrel
be a
big
Post by Suneel Marthi
help.
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
On May 20, 2016, at 8:13 AM, Shannon Quinn <
Agreed, thoroughly enjoying the blog post.
Post by Pat Ferrel
Post by Pat Ferrel
Well done, Trevor! I've not yet had a chance to try
Post by Pat Ferrel
this
in
Post by Suneel Marthi
zeppelin
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
but I just read the blog which is great!
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar
Post by Trevor Grant
that I
mentioned
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
(In the spark module that is)
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Trevor Grant
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on
an
adventure
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
in
Post by Dmitriy Lyubimov
getting
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin to work right after I accidently updated
Post by Pat Ferrel
Post by Trevor Grant
to
the
new
Post by Suneel Marthi
Post by Trevor Grant
snapshot
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
(free
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
hint: the secret was to clear my cache *face-palm*)
Post by Pat Ferrel
Post by Trevor Grant
I'm going to add that dependency to the readme.md
now.
thanks,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Trevor Grant
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes
of
things."
Post by Suneel Marthi
Post by Trevor Grant
-Virgil*
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Trevor Grant
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to
Post by Pat Ferrel
look
at
Post by Suneel Marthi
it
Post by Trevor Grant
closely
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
yet
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
but just a small point: I believe that you'll also
Post by Pat Ferrel
Post by Trevor Grant
Post by Pat Ferrel
need
to
Post by Suneel Marthi
add
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Trevor Grant
Post by Pat Ferrel
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats,
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Post by Trevor Grant
Post by Pat Ferrel
confusion
matrix,
Post by Suneel Marthi
Post by Trevor Grant
and
Post by Andrew Musselman
t-digest.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Andy
Post by Trevor Grant
Post by Pat Ferrel
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Suneel Marthi
2016-05-20 18:46:28 UTC
Permalink
Post by Trevor Grant
Dmitriy really nailed it on the head in his reply to the post which I'll
rebroadcast below. In essence the whole reason you are (theoretically)
using Mahout is the data is to big to fit in memory. If it's to big to fit
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the example I
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that will spit
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably worth doing.
There are a couple of ways to go about it. I opened up the discussion on
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.
I have some general ideas on possible approached to making an honest-mahout
interpreter but I want to play in the code and look at the Flink-Mahout
shell a bit before I try to organize my thoughts and present them.
FYI Trevor, there's no Flink-Mahout shell today; in large part because the
Flink Shell is still busted on their end and we on the Mahout end have not
had time to muck with it. What exists today is the Mahout-Spark shell.
Post by Trevor Grant
...(2) not sure what is the point of supporting distributed anything. It is
distributed presumably because it is hard to keep it in memory. Therefore,
plotting anything distributed potentially presents 2 problems: storage
space and overplotting due to number of points. The idea is that we have to
work out algorithms that condense big data information into small plottable
information (like density grids, for example, or histograms)....
Agreed, something like sampling x% of points from a DRM (like the visuals I
had from Palumbo for the talk in Vancouver that demonstrated the concept)
Post by Trevor Grant
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Great job Trevor, we’ll need this detail to smooth out the sharp edges
and
any guidance from you or the Zeppelin community will be a big help.
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in zeppelin
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Hey Trevor- Just refreshed your readme. The jar that I mentioned is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an adventure in
getting
Zeppelin to work right after I accidently updated to the new snapshot
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Trevor this is very cool- I have not been able to look at it closely
yet
but just a small point: I believe that you'll also need to add the
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below,
however
with
out further ado, I present two notebooks that integrate Mahout +
Spark
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6 with
sparkr
support running already, you may import the following raw notes
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
So my thoughs on next steps, which I'm positing only as a starting
point
- Blog on HOWTO for everyman (assumes no familiarity with Mahout, and
only
enough familiarity with Zeppelin to have Zeppelin + SparkR support)
- Some syntactic sugar somewhere in Mahout to convert a matrix into a
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that Zeppelin is
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out what
is
the
absolute minimum 'bare-bones' mahout we can include, e.g. does the
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do
the
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act like a
Mahout
interpretter
- This is taken care of by setting some env. variables, adding
some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a <plot package of your
choice>
To Pat's point- this is a kind of clumsy pipeline, however the
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular integration). To
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require knowledge of
R
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I guess the
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic sugar.
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this code)
"org.apache.spark.serializer.KryoSerializer",
Post by Pat Ferrel
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to
serialize
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something registered
with
Post by Pat Ferrel
Kryo.
As a quick recap- we're trying to leverage Zeppelin for charting.
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into thinking
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating. Lower I
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me if I'm
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I can
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a dataframe
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic sugar
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin (using
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and an R
library
Post by Pat Ferrel
containing some functions which will pull the data out of the
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting package you
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with matplotlib
and
Post by Pat Ferrel
python (https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
All of this doesn't necessarily require any changing of the Zeppelin
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll make a
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using imports in
Zeppelin. (e.g. a tutorial would be just as at home on the Zeppelin
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS. Things
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional build
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and expose all
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin (which
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of examples
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're doing just as
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or Python and
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which in part
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too easy...)
If
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also need to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc creation
code
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and passed
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells but I
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh Spark
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The mahout
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we may
be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the mahout
plots
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do something with
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo
github.com
rawkintrevo has 12 repositories written in Python, Batchfile, and
R.
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around is much
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin without
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around very
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on the
thread.
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually have both
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're going to
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple scatter plots
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with
some
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization should
be a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're trying to is
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan on
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just reworked
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and surface
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with AngularJS if
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs, maybe
other
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting
permission
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
my java is sketchy at best, i tend to over import. I pulled in the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in more?
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why not let
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven artifacts,
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and creating
an
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly straight forward.
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to show
off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth building a
full
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post with examples
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've done a
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know if you
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
Pat
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The old
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting ready
to
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly functioning
Zeppelin interpreter for Apache Mahout, one could leverage all of
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or anything in R
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of Mahout
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is that
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by nature
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and supports
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the Spark
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin (code is
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice visualizations
of
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are interested in
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand the
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which I have
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to proceed. We
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel (Mahout
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at Zeppelin
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your help
(as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the Mahout
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel, mahout.apache.org<
http://mahout.apache.org/> - that's where we all hangout and not
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Trevor Grant
2016-05-20 19:35:41 UTC
Permalink
FYI:

Looks like Flink shell is fixed :D

https://github.com/apache/flink/pull/1913

(I tested, is working good).



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Trevor Grant
Post by Trevor Grant
Dmitriy really nailed it on the head in his reply to the post which I'll
rebroadcast below. In essence the whole reason you are (theoretically)
using Mahout is the data is to big to fit in memory. If it's to big to
fit
Post by Trevor Grant
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the example I
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that will spit
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably worth doing.
There are a couple of ways to go about it. I opened up the discussion on
we
Post by Trevor Grant
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.
I have some general ideas on possible approached to making an
honest-mahout
Post by Trevor Grant
interpreter but I want to play in the code and look at the Flink-Mahout
shell a bit before I try to organize my thoughts and present them.
FYI Trevor, there's no Flink-Mahout shell today; in large part because the
Flink Shell is still busted on their end and we on the Mahout end have not
had time to muck with it. What exists today is the Mahout-Spark shell.
Post by Trevor Grant
...(2) not sure what is the point of supporting distributed anything. It
is
Post by Trevor Grant
distributed presumably because it is hard to keep it in memory.
Therefore,
Post by Trevor Grant
plotting anything distributed potentially presents 2 problems: storage
space and overplotting due to number of points. The idea is that we have
to
Post by Trevor Grant
work out algorithms that condense big data information into small
plottable
Post by Trevor Grant
information (like density grids, for example, or histograms)....
Agreed, something like sampling x% of points from a DRM (like the visuals I
had from Palumbo for the talk in Vancouver that demonstrated the concept)
Post by Trevor Grant
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Great job Trevor, we’ll need this detail to smooth out the sharp edges
and
any guidance from you or the Zeppelin community will be a big help.
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in zeppelin
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Hey Trevor- Just refreshed your readme. The jar that I mentioned is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Trevor Grant
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an adventure in
getting
Zeppelin to work right after I accidently updated to the new
snapshot
Post by Trevor Grant
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Trevor this is very cool- I have not been able to look at it
closely
Post by Trevor Grant
yet
but just a small point: I believe that you'll also need to add the
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below,
however
with
out further ado, I present two notebooks that integrate Mahout +
Spark
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6 with
sparkr
support running already, you may import the following raw notes
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Trevor Grant
So my thoughs on next steps, which I'm positing only as a starting
point
- Blog on HOWTO for everyman (assumes no familiarity with Mahout,
and
Post by Trevor Grant
only
enough familiarity with Zeppelin to have Zeppelin + SparkR support)
- Some syntactic sugar somewhere in Mahout to convert a matrix
into a
Post by Trevor Grant
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that Zeppelin is
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out
what
Post by Trevor Grant
is
the
absolute minimum 'bare-bones' mahout we can include, e.g. does the
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do
the
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act like a
Mahout
interpretter
- This is taken care of by setting some env. variables, adding
some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a <plot package of your
choice>
To Pat's point- this is a kind of clumsy pipeline, however the
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular integration). To
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require knowledge
of
Post by Trevor Grant
R
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I guess
the
Post by Trevor Grant
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic sugar.
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this code)
"org.apache.spark.serializer.KryoSerializer",
Post by Pat Ferrel
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to
serialize
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something registered
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for charting.
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into thinking
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating.
Lower I
Post by Trevor Grant
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me if
I'm
Post by Trevor Grant
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I can
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a dataframe
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic
sugar
Post by Trevor Grant
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin (using
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and an R
library
Post by Pat Ferrel
containing some functions which will pull the data out of the
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting package
you
Post by Trevor Grant
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Trevor Grant
and
Post by Pat Ferrel
python (https://gist.github.com/andershammar/9070e0f6916a0fbda7a5
)
Post by Trevor Grant
Post by Pat Ferrel
All of this doesn't necessarily require any changing of the
Zeppelin
Post by Trevor Grant
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll make a
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using imports in
Zeppelin. (e.g. a tutorial would be just as at home on the
Zeppelin
Post by Trevor Grant
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS.
Things
Post by Trevor Grant
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional
build
Post by Trevor Grant
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and expose
all
Post by Trevor Grant
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin (which
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of examples
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're doing just
as
Post by Trevor Grant
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or Python
and
Post by Trevor Grant
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which in
part
Post by Trevor Grant
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too easy...)
If
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also need
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Trevor Grant
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc creation
code
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and
passed
Post by Trevor Grant
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells but I
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh Spark
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The
mahout
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we may
be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the mahout
plots
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do something with
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python, Batchfile, and
R.
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around is
much
Post by Trevor Grant
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin
without
Post by Trevor Grant
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around very
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on the
thread.
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually have both
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're going
to
Post by Trevor Grant
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple scatter
plots
Post by Trevor Grant
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with
some
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably ggplot).
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization should
be a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're trying to
is
Post by Trevor Grant
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan on
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just
reworked
Post by Trevor Grant
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and
surface
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with AngularJS if
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs, maybe
other
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting
permission
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I pulled in
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in more?
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why not
let
Post by Trevor Grant
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven artifacts,
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and creating
an
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly straight
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to show
off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth building a
full
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post with
examples
Post by Trevor Grant
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've done
a
Post by Trevor Grant
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know if you
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The
old
Post by Trevor Grant
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting ready
to
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could leverage all of
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or anything in R
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of Mahout
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is that
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by nature
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and supports
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the
Spark
Post by Trevor Grant
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin (code
is
Post by Trevor Grant
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice visualizations
of
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are interested in
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand the
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which I
have
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to proceed. We
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel
(Mahout
Post by Trevor Grant
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at Zeppelin
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your help
(as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the Mahout
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org<
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all hangout and not
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Andrew Musselman
2016-05-20 21:23:52 UTC
Permalink
At this step of the tutorial I'm stuck because I don't have an "Import
Note" link in my Zeppelin home:

"I’m going to do you another favor. Go to the Zeppelin home page and click
on ‘Import Note’. When given the option between URL and json, click on URL
and enter the following link:

https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
"
Post by Trevor Grant
Looks like Flink shell is fixed :D
https://github.com/apache/flink/pull/1913
(I tested, is working good).
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Trevor Grant
Post by Trevor Grant
Dmitriy really nailed it on the head in his reply to the post which
I'll
Post by Trevor Grant
Post by Trevor Grant
rebroadcast below. In essence the whole reason you are (theoretically)
using Mahout is the data is to big to fit in memory. If it's to big to
fit
Post by Trevor Grant
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the example I
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that will
spit
Post by Trevor Grant
Post by Trevor Grant
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably worth
doing.
Post by Trevor Grant
Post by Trevor Grant
There are a couple of ways to go about it. I opened up the discussion
on
mean
Post by Trevor Grant
we
Post by Trevor Grant
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.
I have some general ideas on possible approached to making an
honest-mahout
Post by Trevor Grant
interpreter but I want to play in the code and look at the Flink-Mahout
shell a bit before I try to organize my thoughts and present them.
FYI Trevor, there's no Flink-Mahout shell today; in large part because
the
Post by Trevor Grant
Flink Shell is still busted on their end and we on the Mahout end have
not
Post by Trevor Grant
had time to muck with it. What exists today is the Mahout-Spark shell.
Post by Trevor Grant
...(2) not sure what is the point of supporting distributed anything.
It
Post by Trevor Grant
is
Post by Trevor Grant
distributed presumably because it is hard to keep it in memory.
Therefore,
Post by Trevor Grant
plotting anything distributed potentially presents 2 problems: storage
space and overplotting due to number of points. The idea is that we
have
Post by Trevor Grant
to
Post by Trevor Grant
work out algorithms that condense big data information into small
plottable
Post by Trevor Grant
information (like density grids, for example, or histograms)....
Agreed, something like sampling x% of points from a DRM (like the
visuals I
Post by Trevor Grant
had from Palumbo for the talk in Vancouver that demonstrated the concept)
Post by Trevor Grant
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Great job Trevor, we’ll need this detail to smooth out the sharp
edges
Post by Trevor Grant
Post by Trevor Grant
and
any guidance from you or the Zeppelin community will be a big help.
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in
zeppelin
Post by Trevor Grant
Post by Trevor Grant
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I mentioned
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Trevor Grant
Post by Trevor Grant
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an adventure
in
Post by Trevor Grant
Post by Trevor Grant
getting
Zeppelin to work right after I accidently updated to the new
snapshot
Post by Trevor Grant
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look at it
closely
Post by Trevor Grant
yet
but just a small point: I believe that you'll also need to add
the
Post by Trevor Grant
Post by Trevor Grant
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below,
however
with
out further ado, I present two notebooks that integrate Mahout +
Spark
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6 with
sparkr
support running already, you may import the following raw notes
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Trevor Grant
Post by Trevor Grant
So my thoughs on next steps, which I'm positing only as a
starting
Post by Trevor Grant
Post by Trevor Grant
point
- Blog on HOWTO for everyman (assumes no familiarity with Mahout,
and
Post by Trevor Grant
only
enough familiarity with Zeppelin to have Zeppelin + SparkR
support)
Post by Trevor Grant
Post by Trevor Grant
- Some syntactic sugar somewhere in Mahout to convert a matrix
into a
Post by Trevor Grant
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration
feels
Post by Trevor Grant
Post by Trevor Grant
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that Zeppelin
is
Post by Trevor Grant
Post by Trevor Grant
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out
what
Post by Trevor Grant
is
the
absolute minimum 'bare-bones' mahout we can include, e.g. does
the
Post by Trevor Grant
Post by Trevor Grant
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to
do
Post by Trevor Grant
Post by Trevor Grant
the
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act like
a
Post by Trevor Grant
Post by Trevor Grant
Mahout
interpretter
- This is taken care of by setting some env. variables,
adding
Post by Trevor Grant
Post by Trevor Grant
some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't
have
Post by Trevor Grant
Post by Trevor Grant
zeppelin) in R (python soon) and create a <plot package of your
choice>
To Pat's point- this is a kind of clumsy pipeline, however the
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular integration).
To
Post by Trevor Grant
Post by Trevor Grant
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require knowledge
of
Post by Trevor Grant
R
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I guess
the
Post by Trevor Grant
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic
sugar.
Post by Trevor Grant
Post by Trevor Grant
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this code)
"org.apache.spark.serializer.KryoSerializer",
Post by Pat Ferrel
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to
serialize
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
Post by Trevor Grant
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into
thinking
Post by Trevor Grant
Post by Trevor Grant
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating.
Lower I
Post by Trevor Grant
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me if
I'm
Post by Trevor Grant
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I
can
Post by Trevor Grant
Post by Trevor Grant
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a
dataframe
Post by Trevor Grant
Post by Trevor Grant
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic
sugar
Post by Trevor Grant
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin
(using
Post by Trevor Grant
Post by Trevor Grant
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and an R
library
Post by Pat Ferrel
containing some functions which will pull the data out of the
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting package
you
Post by Trevor Grant
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Trevor Grant
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5
Post by Trevor Grant
)
Post by Trevor Grant
Post by Pat Ferrel
All of this doesn't necessarily require any changing of the
Zeppelin
Post by Trevor Grant
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll
make a
Post by Trevor Grant
Post by Trevor Grant
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using imports
in
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on the
Zeppelin
Post by Trevor Grant
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS.
Things
Post by Trevor Grant
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional
build
Post by Trevor Grant
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and expose
all
Post by Trevor Grant
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin (which
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of examples
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're doing
just
Post by Trevor Grant
as
Post by Trevor Grant
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or Python
and
Post by Trevor Grant
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which in
part
Post by Trevor Grant
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too
easy...)
Post by Trevor Grant
Post by Trevor Grant
If
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also
need
Post by Trevor Grant
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc creation
code
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and
passed
Post by Trevor Grant
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells but
I
Post by Trevor Grant
Post by Trevor Grant
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh Spark
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an sc.
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The
mahout
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we
may
Post by Trevor Grant
Post by Trevor Grant
be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the mahout
plots
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do something
with
Post by Trevor Grant
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev or
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python, Batchfile,
and
Post by Trevor Grant
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical
smile
Post by Trevor Grant
Post by Trevor Grant
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around is
much
Post by Trevor Grant
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin
without
Post by Trevor Grant
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around
very
Post by Trevor Grant
Post by Trevor Grant
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on the
thread.
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually have both
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're going
to
Post by Trevor Grant
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple scatter
plots
Post by Trevor Grant
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something with
some
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization
should
Post by Trevor Grant
Post by Trevor Grant
be a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're trying to
is
Post by Trevor Grant
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan on
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just
reworked
Post by Trevor Grant
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and
surface
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with AngularJS
if
Post by Trevor Grant
Post by Trevor Grant
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs, maybe
other
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting
permission
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I pulled in
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in
more?
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why not
let
Post by Trevor Grant
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven artifacts,
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and
creating
Post by Trevor Grant
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly straight
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to
show
Post by Trevor Grant
Post by Trevor Grant
off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth building a
full
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post with
examples
Post by Trevor Grant
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've
done
Post by Trevor Grant
a
Post by Trevor Grant
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know if
you
Post by Trevor Grant
Post by Trevor Grant
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago. (The
old
Post by Trevor Grant
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting
ready
Post by Trevor Grant
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could leverage all
of
Post by Trevor Grant
Post by Trevor Grant
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or anything in
R
Post by Trevor Grant
Post by Trevor Grant
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of Mahout
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is that
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by nature
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and supports
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the
Spark
Post by Trevor Grant
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin (code
is
Post by Trevor Grant
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
Post by Trevor Grant
of
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are interested
in
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand
the
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which I
have
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to proceed.
We
Post by Trevor Grant
Post by Trevor Grant
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
Hi Trevor,
Nice meeting u last week in Vancouver. Per our conversation, I
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel
(Mahout
Post by Trevor Grant
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at Zeppelin
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your
help
Post by Trevor Grant
Post by Trevor Grant
(as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the
Mahout
Post by Trevor Grant
Post by Trevor Grant
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org<
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all hangout and
not
Post by Trevor Grant
Post by Trevor Grant
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Trevor Grant
2016-05-20 21:28:10 UTC
Permalink
That's a "new" feature in the 0.6-snapshot... Say within the last month or
two, how long has it been since you did a git pull?

I'll update soon with a note on that.

I can also create a gist with the code.
Post by Andrew Musselman
At this step of the tutorial I'm stuck because I don't have an "Import
"I’m going to do you another favor. Go to the Zeppelin home page and click
on ‘Import Note’. When given the option between URL and json, click on URL
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
"
Post by Trevor Grant
Looks like Flink shell is fixed :D
https://github.com/apache/flink/pull/1913
(I tested, is working good).
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Fri, May 20, 2016 at 12:54 PM, Trevor Grant <
Post by Trevor Grant
Dmitriy really nailed it on the head in his reply to the post which
I'll
Post by Trevor Grant
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Trevor Grant
Post by Trevor Grant
using Mahout is the data is to big to fit in memory. If it's to big
to
Post by Trevor Grant
fit
Post by Trevor Grant
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the example I
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that will
spit
Post by Trevor Grant
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably worth
doing.
Post by Trevor Grant
There are a couple of ways to go about it. I opened up the discussion
on
mean
we
Post by Trevor Grant
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.
I have some general ideas on possible approached to making an
honest-mahout
Post by Trevor Grant
interpreter but I want to play in the code and look at the
Flink-Mahout
Post by Trevor Grant
Post by Trevor Grant
shell a bit before I try to organize my thoughts and present them.
FYI Trevor, there's no Flink-Mahout shell today; in large part because
the
Flink Shell is still busted on their end and we on the Mahout end have
not
had time to muck with it. What exists today is the Mahout-Spark shell.
Post by Trevor Grant
...(2) not sure what is the point of supporting distributed anything.
It
is
Post by Trevor Grant
distributed presumably because it is hard to keep it in memory.
Therefore,
storage
Post by Trevor Grant
Post by Trevor Grant
space and overplotting due to number of points. The idea is that we
have
to
Post by Trevor Grant
work out algorithms that condense big data information into small
plottable
Post by Trevor Grant
information (like density grids, for example, or histograms)....
Agreed, something like sampling x% of points from a DRM (like the
visuals I
had from Palumbo for the talk in Vancouver that demonstrated the
concept)
Post by Trevor Grant
Post by Trevor Grant
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by Trevor Grant
Great job Trevor, we’ll need this detail to smooth out the sharp
edges
Post by Trevor Grant
and
any guidance from you or the Zeppelin community will be a big help.
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in
zeppelin
Post by Trevor Grant
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I
mentioned
Post by Trevor Grant
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Trevor Grant
Post by Trevor Grant
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an adventure
in
Post by Trevor Grant
getting
Zeppelin to work right after I accidently updated to the new
snapshot
Post by Trevor Grant
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look at it
closely
Post by Trevor Grant
yet
but just a small point: I believe that you'll also need to add
the
Post by Trevor Grant
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix, and
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments below,
however
with
out further ado, I present two notebooks that integrate Mahout
+
Post by Trevor Grant
Post by Trevor Grant
Spark
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6
with
Post by Trevor Grant
Post by Trevor Grant
sparkr
support running already, you may import the following raw notes
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Trevor Grant
Post by Trevor Grant
So my thoughs on next steps, which I'm positing only as a
starting
Post by Trevor Grant
point
- Blog on HOWTO for everyman (assumes no familiarity with
Mahout,
Post by Trevor Grant
and
Post by Trevor Grant
only
enough familiarity with Zeppelin to have Zeppelin + SparkR
support)
Post by Trevor Grant
- Some syntactic sugar somewhere in Mahout to convert a matrix
into a
Post by Trevor Grant
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration
feels
Post by Trevor Grant
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that
Zeppelin
Post by Trevor Grant
is
Post by Trevor Grant
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding out
what
Post by Trevor Grant
is
the
absolute minimum 'bare-bones' mahout we can include, e.g. does
the
Post by Trevor Grant
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to
do
Post by Trevor Grant
the
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act
like
Post by Trevor Grant
a
Post by Trevor Grant
Mahout
interpretter
- This is taken care of by setting some env. variables,
adding
Post by Trevor Grant
some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource
pool
Post by Trevor Grant
Post by Trevor Grant
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't
have
Post by Trevor Grant
zeppelin) in R (python soon) and create a <plot package of your
choice>
To Pat's point- this is a kind of clumsy pipeline, however the
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but the
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular
integration).
Post by Trevor Grant
To
Post by Trevor Grant
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require
knowledge
Post by Trevor Grant
of
Post by Trevor Grant
R
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I
guess
Post by Trevor Grant
the
Post by Trevor Grant
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic
sugar.
Post by Trevor Grant
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this
code)
Post by Trevor Grant
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
Post by Pat Ferrel
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to
serialize
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Trevor Grant
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into
thinking
Post by Trevor Grant
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating.
Lower I
Post by Trevor Grant
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me
if
Post by Trevor Grant
I'm
Post by Trevor Grant
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object that I
can
Post by Trevor Grant
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a
dataframe
Post by Trevor Grant
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some syntactic
sugar
Post by Trevor Grant
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin
(using
Post by Trevor Grant
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and
an R
Post by Trevor Grant
Post by Trevor Grant
library
Post by Pat Ferrel
containing some functions which will pull the data out of the
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting
package
Post by Trevor Grant
you
Post by Trevor Grant
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Trevor Grant
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5
)
Post by Trevor Grant
Post by Pat Ferrel
All of this doesn't necessarily require any changing of the
Zeppelin
Post by Trevor Grant
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll
make a
Post by Trevor Grant
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using
imports
Post by Trevor Grant
in
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on the
Zeppelin
Post by Trevor Grant
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS.
Things
Post by Trevor Grant
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an optional
build
Post by Trevor Grant
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and
expose
Post by Trevor Grant
all
Post by Trevor Grant
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin
(which
Post by Trevor Grant
Post by Trevor Grant
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of
examples
Post by Trevor Grant
Post by Trevor Grant
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then some
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're doing
just
as
Post by Trevor Grant
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or
Python
Post by Trevor Grant
and
Post by Trevor Grant
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which
in
Post by Trevor Grant
part
Post by Trevor Grant
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also
need
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc
creation
Post by Trevor Grant
Post by Trevor Grant
code
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark and
passed
Post by Trevor Grant
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells
but
Post by Trevor Grant
I
Post by Trevor Grant
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh
Spark
Post by Trevor Grant
Post by Trevor Grant
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an
sc.
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority, The
mahout
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point we
may
Post by Trevor Grant
be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the
mahout
Post by Trevor Grant
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do something
with
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Trevor Grant
Post by Trevor Grant
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc dev
or
Post by Trevor Grant
Post by Trevor Grant
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
]<
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python, Batchfile,
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical
smile
Post by Trevor Grant
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around
is
Post by Trevor Grant
much
Post by Trevor Grant
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin
without
Post by Trevor Grant
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all around
very
Post by Trevor Grant
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on the
thread.
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually have
both
Post by Trevor Grant
Post by Trevor Grant
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're
going
Post by Trevor Grant
to
Post by Trevor Grant
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple scatter
plots
Post by Trevor Grant
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the pngs.
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something
with
Post by Trevor Grant
Post by Trevor Grant
some
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on that
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization
should
Post by Trevor Grant
be a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're trying
to
Post by Trevor Grant
is
Post by Trevor Grant
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan
on
Post by Trevor Grant
Post by Trevor Grant
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just
reworked
Post by Trevor Grant
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and
surface
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and can
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples on the closed
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking at
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with
AngularJS
Post by Trevor Grant
if
Post by Trevor Grant
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs,
maybe
Post by Trevor Grant
Post by Trevor Grant
other
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting
permission
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I pulled
in
Post by Trevor Grant
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in
more?
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why
not
Post by Trevor Grant
let
Post by Trevor Grant
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven
artifacts,
Post by Trevor Grant
Post by Trevor Grant
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly straight
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to
show
Post by Trevor Grant
off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth
building a
Post by Trevor Grant
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post with
examples
Post by Trevor Grant
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've
done
a
Post by Trevor Grant
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know if
you
Post by Trevor Grant
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Trevor Grant
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago.
(The
Post by Trevor Grant
old
Post by Trevor Grant
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could leverage
all
Post by Trevor Grant
of
Post by Trevor Grant
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or anything
in
Post by Trevor Grant
R
Post by Trevor Grant
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as well.
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of things."
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of Mahout
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is
that
Post by Trevor Grant
Post by Trevor Grant
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by nature
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and
supports
Post by Trevor Grant
Post by Trevor Grant
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support too.
In any case you probably know we have a Mahout version of the
Spark
Post by Trevor Grant
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin
(code
Post by Trevor Grant
is
Post by Trevor Grant
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are
interested
Post by Trevor Grant
in
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I understand
the
Post by Trevor Grant
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS, which
I
Post by Trevor Grant
have
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to proceed.
We
Post by Trevor Grant
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our
conversation, I
Post by Trevor Grant
Post by Trevor Grant
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel
(Mahout
Post by Trevor Grant
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at
Zeppelin
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your
help
Post by Trevor Grant
(as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the
Mahout
Post by Trevor Grant
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org<
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all hangout and
not
Post by Trevor Grant
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Andrew Musselman
2016-05-20 21:31:43 UTC
Permalink
Ah, well I cloned the Till branch per your Nov 3 article..

git clone https://github.com/tillrohrmann/incubator-zeppelin.git
Post by Trevor Grant
That's a "new" feature in the 0.6-snapshot... Say within the last month or
two, how long has it been since you did a git pull?
I'll update soon with a note on that.
I can also create a gist with the code.
Post by Andrew Musselman
At this step of the tutorial I'm stuck because I don't have an "Import
"I’m going to do you another favor. Go to the Zeppelin home page and
click
Post by Andrew Musselman
on ‘Import Note’. When given the option between URL and json, click on
URL
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
Post by Andrew Musselman
"
Post by Trevor Grant
Looks like Flink shell is fixed :D
https://github.com/apache/flink/pull/1913
(I tested, is working good).
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Fri, May 20, 2016 at 12:54 PM, Trevor Grant <
Post by Trevor Grant
Dmitriy really nailed it on the head in his reply to the post which
I'll
Post by Trevor Grant
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Trevor Grant
Post by Trevor Grant
using Mahout is the data is to big to fit in memory. If it's to
big
Post by Andrew Musselman
to
Post by Trevor Grant
fit
Post by Trevor Grant
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the example
I
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that
will
Post by Andrew Musselman
Post by Trevor Grant
spit
Post by Trevor Grant
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably worth
doing.
Post by Trevor Grant
There are a couple of ways to go about it. I opened up the
discussion
Post by Andrew Musselman
Post by Trevor Grant
on
mean
we
Post by Trevor Grant
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.
I have some general ideas on possible approached to making an
honest-mahout
Post by Trevor Grant
interpreter but I want to play in the code and look at the
Flink-Mahout
Post by Trevor Grant
Post by Trevor Grant
shell a bit before I try to organize my thoughts and present them.
FYI Trevor, there's no Flink-Mahout shell today; in large part
because
Post by Andrew Musselman
Post by Trevor Grant
the
Flink Shell is still busted on their end and we on the Mahout end
have
Post by Andrew Musselman
Post by Trevor Grant
not
had time to muck with it. What exists today is the Mahout-Spark
shell.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
...(2) not sure what is the point of supporting distributed
anything.
Post by Andrew Musselman
Post by Trevor Grant
It
is
Post by Trevor Grant
distributed presumably because it is hard to keep it in memory.
Therefore,
storage
Post by Trevor Grant
Post by Trevor Grant
space and overplotting due to number of points. The idea is that we
have
to
Post by Trevor Grant
work out algorithms that condense big data information into small
plottable
Post by Trevor Grant
information (like density grids, for example, or histograms)....
Agreed, something like sampling x% of points from a DRM (like the
visuals I
had from Palumbo for the talk in Vancouver that demonstrated the
concept)
Post by Trevor Grant
Post by Trevor Grant
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by Trevor Grant
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth out the sharp
edges
Post by Trevor Grant
and
any guidance from you or the Zeppelin community will be a big
help.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in
zeppelin
Post by Trevor Grant
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I
mentioned
Post by Trevor Grant
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an
adventure
Post by Andrew Musselman
Post by Trevor Grant
in
Post by Trevor Grant
getting
Zeppelin to work right after I accidently updated to the new
snapshot
Post by Trevor Grant
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look at it
closely
Post by Trevor Grant
yet
but just a small point: I believe that you'll also need to
add
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Trevor Grant
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix,
and
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments
below,
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
however
with
out further ado, I present two notebooks that integrate
Mahout
Post by Andrew Musselman
+
Post by Trevor Grant
Post by Trevor Grant
Spark
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6
with
Post by Trevor Grant
Post by Trevor Grant
sparkr
support running already, you may import the following raw
notes
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
So my thoughs on next steps, which I'm positing only as a
starting
Post by Trevor Grant
point
- Blog on HOWTO for everyman (assumes no familiarity with
Mahout,
Post by Trevor Grant
and
Post by Trevor Grant
only
enough familiarity with Zeppelin to have Zeppelin + SparkR
support)
Post by Trevor Grant
- Some syntactic sugar somewhere in Mahout to convert a
matrix
Post by Andrew Musselman
Post by Trevor Grant
into a
Post by Trevor Grant
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration
feels
Post by Trevor Grant
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that
Zeppelin
Post by Trevor Grant
is
Post by Trevor Grant
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding
out
Post by Andrew Musselman
Post by Trevor Grant
what
Post by Trevor Grant
is
the
absolute minimum 'bare-bones' mahout we can include, e.g.
does
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Trevor Grant
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how
to
Post by Andrew Musselman
Post by Trevor Grant
do
Post by Trevor Grant
the
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act
like
Post by Trevor Grant
a
Post by Trevor Grant
Mahout
interpretter
- This is taken care of by setting some env. variables,
adding
Post by Trevor Grant
some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource
pool
Post by Trevor Grant
Post by Trevor Grant
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't
have
Post by Trevor Grant
zeppelin) in R (python soon) and create a <plot package of
your
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
choice>
To Pat's point- this is a kind of clumsy pipeline, however
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular
integration).
Post by Trevor Grant
To
Post by Trevor Grant
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require
knowledge
Post by Trevor Grant
of
Post by Trevor Grant
R
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I
guess
Post by Trevor Grant
the
Post by Trevor Grant
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic
sugar.
Post by Trevor Grant
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this
code)
Post by Trevor Grant
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
Post by Pat Ferrel
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to
serialize
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Trevor Grant
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into
thinking
Post by Trevor Grant
things
Post by Pat Ferrel
were 'working' before because I wasn't actually integrating.
Lower I
Post by Trevor Grant
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell me
if
Post by Trevor Grant
I'm
Post by Trevor Grant
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object
that I
Post by Andrew Musselman
Post by Trevor Grant
can
Post by Trevor Grant
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a
dataframe
Post by Trevor Grant
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some
syntactic
Post by Andrew Musselman
Post by Trevor Grant
sugar
Post by Trevor Grant
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin
(using
Post by Trevor Grant
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and
an R
Post by Trevor Grant
Post by Trevor Grant
library
Post by Pat Ferrel
containing some functions which will pull the data out of
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting
package
Post by Trevor Grant
you
Post by Trevor Grant
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Trevor Grant
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5
)
Post by Trevor Grant
Post by Pat Ferrel
All of this doesn't necessarily require any changing of the
Zeppelin
Post by Trevor Grant
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll
make a
Post by Trevor Grant
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using
imports
Post by Trevor Grant
in
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on the
Zeppelin
Post by Trevor Grant
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's angularJS.
Things
Post by Trevor Grant
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an
optional
Post by Andrew Musselman
Post by Trevor Grant
build
Post by Trevor Grant
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and
expose
Post by Trevor Grant
all
Post by Trevor Grant
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin
(which
Post by Trevor Grant
Post by Trevor Grant
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of
examples
Post by Trevor Grant
Post by Trevor Grant
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then
some
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're doing
just
as
Post by Trevor Grant
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or
Python
Post by Trevor Grant
and
Post by Trevor Grant
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro (which
in
Post by Trevor Grant
part
Post by Trevor Grant
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local, also
need
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like registering
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf for
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc
creation
Post by Trevor Grant
Post by Trevor Grant
code
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark
and
Post by Andrew Musselman
Post by Trevor Grant
passed
Post by Trevor Grant
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells
but
Post by Trevor Grant
I
Post by Trevor Grant
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh
Spark
Post by Trevor Grant
Post by Trevor Grant
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with an
sc.
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority,
The
Post by Andrew Musselman
Post by Trevor Grant
mahout
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some point
we
Post by Andrew Musselman
Post by Trevor Grant
may
Post by Trevor Grant
be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the
mahout
Post by Trevor Grant
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do something
with
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it
(probably
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Trevor Grant
Post by Trevor Grant
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc
dev
Post by Andrew Musselman
or
Post by Trevor Grant
Post by Trevor Grant
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[
https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
Post by Andrew Musselman
]<
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python,
Batchfile,
Post by Andrew Musselman
Post by Trevor Grant
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical
smile
Post by Trevor Grant
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way around
is
Post by Trevor Grant
much
Post by Trevor Grant
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin
without
Post by Trevor Grant
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all
around
Post by Andrew Musselman
Post by Trevor Grant
very
Post by Trevor Grant
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
thread.
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually have
both
Post by Trevor Grant
Post by Trevor Grant
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're
going
Post by Trevor Grant
to
Post by Trevor Grant
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple
scatter
Post by Andrew Musselman
Post by Trevor Grant
plots
Post by Trevor Grant
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the
pngs.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something
with
Post by Trevor Grant
Post by Trevor Grant
some
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization
should
Post by Trevor Grant
be a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're
trying
Post by Andrew Musselman
to
Post by Trevor Grant
is
Post by Trevor Grant
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We plan
on
Post by Trevor Grant
Post by Trevor Grant
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just
reworked
Post by Trevor Grant
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and
surface
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and
can
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples on the closed
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking
at
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the new
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with
AngularJS
Post by Trevor Grant
if
Post by Trevor Grant
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs,
maybe
Post by Trevor Grant
Post by Trevor Grant
other
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting
permission
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I pulled
in
Post by Trevor Grant
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in
more?
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public) this
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg why
not
Post by Trevor Grant
let
Post by Trevor Grant
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven
artifacts,
Post by Trevor Grant
Post by Trevor Grant
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly straight
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just to
show
Post by Trevor Grant
off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth
building a
Post by Trevor Grant
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post with
examples
Post by Trevor Grant
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much. I've
done
a
Post by Trevor Grant
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me know
if
Post by Andrew Musselman
Post by Trevor Grant
you
Post by Trevor Grant
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Trevor Grant
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago.
(The
Post by Trevor Grant
old
Post by Trevor Grant
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually getting
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could leverage
all
Post by Trevor Grant
of
Post by Trevor Grant
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or anything
in
Post by Trevor Grant
R
Post by Trevor Grant
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as
well.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of Mahout
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection of
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is
that
Post by Trevor Grant
Post by Trevor Grant
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by
nature
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and
supports
Post by Trevor Grant
Post by Trevor Grant
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support
too.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
In any case you probably know we have a Mahout version of
the
Post by Andrew Musselman
Post by Trevor Grant
Spark
Post by Trevor Grant
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin
(code
Post by Trevor Grant
is
Post by Trevor Grant
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are
interested
Post by Trevor Grant
in
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I
understand
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Trevor Grant
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS,
which
Post by Andrew Musselman
I
Post by Trevor Grant
have
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to
proceed.
Post by Andrew Musselman
Post by Trevor Grant
We
Post by Trevor Grant
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our
conversation, I
Post by Trevor Grant
Post by Trevor Grant
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel
(Mahout
Post by Trevor Grant
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at
Zeppelin
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate your
help
Post by Trevor Grant
(as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the
Mahout
Post by Trevor Grant
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org<
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all hangout
and
Post by Andrew Musselman
Post by Trevor Grant
not
Post by Trevor Grant
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Andrew Musselman
2016-05-20 21:35:07 UTC
Permalink
In any case, still getting this error in the console when I run this block:

"import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._

implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext =
sc2sdc(sc)"

"<console>:21: error: object mahout is not a member of package org.apache
import org.apache.mahout.math._"

On Fri, May 20, 2016 at 2:31 PM, Andrew Musselman <
Post by Andrew Musselman
Ah, well I cloned the Till branch per your Nov 3 article..
git clone https://github.com/tillrohrmann/incubator-zeppelin.git
Post by Trevor Grant
That's a "new" feature in the 0.6-snapshot... Say within the last month or
two, how long has it been since you did a git pull?
I'll update soon with a note on that.
I can also create a gist with the code.
Post by Andrew Musselman
At this step of the tutorial I'm stuck because I don't have an "Import
"I’m going to do you another favor. Go to the Zeppelin home page and
click
Post by Andrew Musselman
on ‘Import Note’. When given the option between URL and json, click on
URL
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
Post by Andrew Musselman
"
On Fri, May 20, 2016 at 12:35 PM, Trevor Grant <
Post by Trevor Grant
Looks like Flink shell is fixed :D
https://github.com/apache/flink/pull/1913
(I tested, is working good).
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Andrew Musselman
Post by Trevor Grant
On Fri, May 20, 2016 at 12:54 PM, Trevor Grant <
Post by Trevor Grant
Dmitriy really nailed it on the head in his reply to the post
which
Post by Andrew Musselman
Post by Trevor Grant
I'll
Post by Trevor Grant
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Trevor Grant
Post by Trevor Grant
using Mahout is the data is to big to fit in memory. If it's to
big
Post by Andrew Musselman
to
Post by Trevor Grant
fit
Post by Trevor Grant
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels). For the
example I
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that
will
Post by Andrew Musselman
Post by Trevor Grant
spit
Post by Trevor Grant
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably worth
doing.
Post by Trevor Grant
There are a couple of ways to go about it. I opened up the
discussion
Post by Andrew Musselman
Post by Trevor Grant
on
to
Post by Andrew Musselman
Post by Trevor Grant
mean
we
Post by Trevor Grant
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do
that
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
preprocessing, and one that will turn something into a tsv string.
I have some general ideas on possible approached to making an
honest-mahout
Post by Trevor Grant
interpreter but I want to play in the code and look at the
Flink-Mahout
Post by Trevor Grant
Post by Trevor Grant
shell a bit before I try to organize my thoughts and present them.
FYI Trevor, there's no Flink-Mahout shell today; in large part
because
Post by Andrew Musselman
Post by Trevor Grant
the
Flink Shell is still busted on their end and we on the Mahout end
have
Post by Andrew Musselman
Post by Trevor Grant
not
had time to muck with it. What exists today is the Mahout-Spark
shell.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
...(2) not sure what is the point of supporting distributed
anything.
Post by Andrew Musselman
Post by Trevor Grant
It
is
Post by Trevor Grant
distributed presumably because it is hard to keep it in memory.
Therefore,
storage
Post by Trevor Grant
Post by Trevor Grant
space and overplotting due to number of points. The idea is that
we
Post by Andrew Musselman
Post by Trevor Grant
have
to
Post by Trevor Grant
work out algorithms that condense big data information into small
plottable
Post by Trevor Grant
information (like density grids, for example, or histograms)....
Agreed, something like sampling x% of points from a DRM (like the
visuals I
had from Palumbo for the talk in Vancouver that demonstrated the
concept)
Post by Trevor Grant
Post by Trevor Grant
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by Trevor Grant
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth out the sharp
edges
Post by Trevor Grant
and
any guidance from you or the Zeppelin community will be a big
help.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in
zeppelin
Post by Trevor Grant
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I
mentioned
Post by Trevor Grant
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an
adventure
Post by Andrew Musselman
Post by Trevor Grant
in
Post by Trevor Grant
getting
Zeppelin to work right after I accidently updated to the new
snapshot
Post by Trevor Grant
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look at it
closely
Post by Trevor Grant
yet
but just a small point: I believe that you'll also need to
add
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Trevor Grant
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion matrix,
and
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments
below,
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
however
with
out further ado, I present two notebooks that integrate
Mahout
Post by Andrew Musselman
+
Post by Trevor Grant
Post by Trevor Grant
Spark
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin 0.6
with
Post by Trevor Grant
Post by Trevor Grant
sparkr
support running already, you may import the following raw
notes
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
So my thoughs on next steps, which I'm positing only as a
starting
Post by Trevor Grant
point
for discussion, and are in no particular order of
- Blog on HOWTO for everyman (assumes no familiarity with
Mahout,
Post by Trevor Grant
and
Post by Trevor Grant
only
enough familiarity with Zeppelin to have Zeppelin + SparkR
support)
Post by Trevor Grant
- Some syntactic sugar somewhere in Mahout to convert a
matrix
Post by Andrew Musselman
Post by Trevor Grant
into a
Post by Trevor Grant
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration
feels
Post by Trevor Grant
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that
Zeppelin
Post by Trevor Grant
is
Post by Trevor Grant
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support finding
out
Post by Andrew Musselman
Post by Trevor Grant
what
Post by Trevor Grant
is
the
absolute minimum 'bare-bones' mahout we can include, e.g.
does
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Trevor Grant
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing
how to
Post by Andrew Musselman
Post by Trevor Grant
do
Post by Trevor Grant
the
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to act
like
Post by Trevor Grant
a
Post by Trevor Grant
Mahout
interpretter
- This is taken care of by setting some env. variables,
adding
Post by Trevor Grant
some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource
pool
Post by Trevor Grant
Post by Trevor Grant
- This could be done to a disk if you didn't have
zeppelin
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
4) read the tsv from the resource pool (or disk if you
didn't
Post by Andrew Musselman
Post by Trevor Grant
have
Post by Trevor Grant
zeppelin) in R (python soon) and create a <plot package of
your
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
choice>
To Pat's point- this is a kind of clumsy pipeline, however
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular
integration).
Post by Trevor Grant
To
Post by Trevor Grant
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require
knowledge
Post by Trevor Grant
of
Post by Trevor Grant
R
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I
guess
Post by Trevor Grant
the
Post by Trevor Grant
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala syntactic
sugar.
Post by Trevor Grant
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this
code)
Post by Trevor Grant
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
Post by Pat Ferrel
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have to
serialize
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Trevor Grant
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into
thinking
Post by Trevor Grant
things
Post by Pat Ferrel
were 'working' before because I wasn't actually
integrating.
Post by Andrew Musselman
Post by Trevor Grant
Lower I
Post by Trevor Grant
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell
me
Post by Andrew Musselman
if
Post by Trevor Grant
I'm
Post by Trevor Grant
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object
that I
Post by Andrew Musselman
Post by Trevor Grant
can
Post by Trevor Grant
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a
dataframe
Post by Trevor Grant
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some
syntactic
Post by Andrew Musselman
Post by Trevor Grant
sugar
Post by Trevor Grant
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to zeppelin
(using
Post by Trevor Grant
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink) and
an R
Post by Trevor Grant
Post by Trevor Grant
library
Post by Pat Ferrel
containing some functions which will pull the data out of
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting
package
Post by Trevor Grant
you
Post by Trevor Grant
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Trevor Grant
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5
)
Post by Trevor Grant
Post by Pat Ferrel
All of this doesn't necessarily require any changing of the
Zeppelin
Post by Trevor Grant
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up, I'll
make a
Post by Trevor Grant
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using
imports
Post by Trevor Grant
in
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on the
Zeppelin
Post by Trevor Grant
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's
angularJS.
Post by Andrew Musselman
Post by Trevor Grant
Things
Post by Trevor Grant
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an
optional
Post by Andrew Musselman
Post by Trevor Grant
build
Post by Trevor Grant
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and
expose
Post by Trevor Grant
all
Post by Trevor Grant
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin
(which
Post by Trevor Grant
Post by Trevor Grant
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of
examples
Post by Trevor Grant
Post by Trevor Grant
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then
some
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're
doing
Post by Andrew Musselman
Post by Trevor Grant
just
as
Post by Trevor Grant
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or
Python
Post by Trevor Grant
and
Post by Trevor Grant
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro
(which
Post by Andrew Musselman
in
Post by Trevor Grant
part
Post by Trevor Grant
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Add artifacts (need to change these to maven not local,
also
Post by Andrew Musselman
Post by Trevor Grant
need
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like
registering
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf
for
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc
creation
Post by Trevor Grant
Post by Trevor Grant
code
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark
and
Post by Andrew Musselman
Post by Trevor Grant
passed
Post by Trevor Grant
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain cells
but
Post by Trevor Grant
I
Post by Trevor Grant
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than teh
Spark
Post by Trevor Grant
Post by Trevor Grant
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with
an
Post by Andrew Musselman
sc.
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority,
The
Post by Andrew Musselman
Post by Trevor Grant
mahout
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some
point we
Post by Andrew Musselman
Post by Trevor Grant
may
Post by Trevor Grant
be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the
mahout
Post by Trevor Grant
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do
something
Post by Andrew Musselman
Post by Trevor Grant
with
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it
(probably
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Trevor Grant
Post by Trevor Grant
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc
dev
Post by Andrew Musselman
or
Post by Trevor Grant
Post by Trevor Grant
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[
https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
Post by Andrew Musselman
]<
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python,
Batchfile,
Post by Andrew Musselman
Post by Trevor Grant
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit
skeptical
Post by Andrew Musselman
Post by Trevor Grant
smile
Post by Trevor Grant
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way
around
Post by Andrew Musselman
is
Post by Trevor Grant
much
Post by Trevor Grant
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in Zeppelin
without
Post by Trevor Grant
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all
around
Post by Andrew Musselman
Post by Trevor Grant
very
Post by Trevor Grant
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
thread.
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually have
both
Post by Trevor Grant
Post by Trevor Grant
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future we're
going
Post by Trevor Grant
to
Post by Trevor Grant
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple
scatter
Post by Andrew Musselman
Post by Trevor Grant
plots
Post by Trevor Grant
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the
pngs.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing it...
OK. I'll read through the examples and try to do something
with
Post by Trevor Grant
Post by Trevor Grant
some
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking visualization
should
Post by Trevor Grant
be a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage zeppelins
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're
trying
Post by Andrew Musselman
to
Post by Trevor Grant
is
Post by Trevor Grant
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We
plan
Post by Andrew Musselman
on
Post by Trevor Grant
Post by Trevor Grant
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've just
reworked
Post by Trevor Grant
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid and
surface
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and
can
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples on the closed
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile as a
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are looking
at
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the
new
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with
AngularJS
Post by Trevor Grant
if
Post by Trevor Grant
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs,
maybe
Post by Trevor Grant
Post by Trevor Grant
other
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble getting
permission
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I
pulled
Post by Andrew Musselman
in
Post by Trevor Grant
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling in
more?
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public)
this
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg
why
Post by Andrew Musselman
not
Post by Trevor Grant
let
Post by Trevor Grant
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven
artifacts,
Post by Trevor Grant
Post by Trevor Grant
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the races...
that said, adding a build profile like -PsparkMahout and
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly straight
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be more
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just
to
Post by Andrew Musselman
Post by Trevor Grant
show
Post by Trevor Grant
off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth
building a
Post by Trevor Grant
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post with
examples
Post by Trevor Grant
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much.
I've
Post by Andrew Musselman
Post by Trevor Grant
done
a
Post by Trevor Grant
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me
know if
Post by Andrew Musselman
Post by Trevor Grant
you
Post by Trevor Grant
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Trevor Grant
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while ago.
(The
Post by Trevor Grant
old
Post by Trevor Grant
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually
getting
Post by Andrew Musselman
Post by Trevor Grant
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could leverage
all
Post by Trevor Grant
of
Post by Trevor Grant
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or
anything
Post by Andrew Musselman
in
Post by Trevor Grant
R
Post by Trevor Grant
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as
well.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of
Mahout
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara is a
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection
of
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit is
that
Post by Trevor Grant
Post by Trevor Grant
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by
nature
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and
supports
Post by Trevor Grant
Post by Trevor Grant
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support
too.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
In any case you probably know we have a Mahout version of
the
Post by Andrew Musselman
Post by Trevor Grant
Spark
Post by Trevor Grant
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of Zeppelin
(code
Post by Trevor Grant
is
Post by Trevor Grant
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are
interested
Post by Trevor Grant
in
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I
understand
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Trevor Grant
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS,
which
Post by Andrew Musselman
I
Post by Trevor Grant
have
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to
proceed.
Post by Andrew Musselman
Post by Trevor Grant
We
Post by Trevor Grant
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our
conversation, I
Post by Trevor Grant
Post by Trevor Grant
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat
Ferrel
Post by Andrew Musselman
Post by Trevor Grant
(Mahout
Post by Trevor Grant
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at
Zeppelin
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate
your
Post by Andrew Musselman
Post by Trevor Grant
help
Post by Trevor Grant
(as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping the
Mahout
Post by Trevor Grant
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org<
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all hangout
and
Post by Andrew Musselman
Post by Trevor Grant
not
Post by Trevor Grant
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Trevor Grant
2016-05-20 21:54:36 UTC
Permalink
If appears the jars aren't loading.

Did you add those artifacts?

If your version is the one cloned from tills that's fairly ancient.

I need to update that post badly.

Do a fresh git clone from apache/incubator-zeppelin the point of my last
post was to get flink 0.10 working w Zeppelin pre release. Zeppelin
snapshot is now on 1.0
Post by Andrew Musselman
"import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext =
sc2sdc(sc)"
"<console>:21: error: object mahout is not a member of package org.apache
import org.apache.mahout.math._"
On Fri, May 20, 2016 at 2:31 PM, Andrew Musselman <
Post by Andrew Musselman
Ah, well I cloned the Till branch per your Nov 3 article..
git clone https://github.com/tillrohrmann/incubator-zeppelin.git
Post by Trevor Grant
That's a "new" feature in the 0.6-snapshot... Say within the last month
or
Post by Andrew Musselman
Post by Trevor Grant
two, how long has it been since you did a git pull?
I'll update soon with a note on that.
I can also create a gist with the code.
Post by Andrew Musselman
At this step of the tutorial I'm stuck because I don't have an "Import
"I’m going to do you another favor. Go to the Zeppelin home page and
click
Post by Andrew Musselman
on ‘Import Note’. When given the option between URL and json, click on
URL
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
"
On Fri, May 20, 2016 at 12:35 PM, Trevor Grant <
Post by Trevor Grant
Looks like Flink shell is fixed :D
https://github.com/apache/flink/pull/1913
(I tested, is working good).
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Andrew Musselman
Post by Trevor Grant
On Fri, May 20, 2016 at 12:54 PM, Trevor Grant <
Post by Trevor Grant
Dmitriy really nailed it on the head in his reply to the post
which
Post by Andrew Musselman
Post by Trevor Grant
I'll
Post by Trevor Grant
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Trevor Grant
Post by Trevor Grant
using Mahout is the data is to big to fit in memory. If it's to
big
Post by Andrew Musselman
to
Post by Trevor Grant
fit
Post by Trevor Grant
in memory, well then its probably too big to plot each point
(e.g.
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
trillions of row, you only have so many pixels). For the
example I
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that
will
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function that
will
Post by Andrew Musselman
Post by Trevor Grant
spit
Post by Trevor Grant
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably
worth
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
doing.
Post by Trevor Grant
There are a couple of ways to go about it. I opened up the
discussion
Post by Andrew Musselman
Post by Trevor Grant
on
to
Post by Andrew Musselman
Post by Trevor Grant
mean
we
Post by Trevor Grant
can do it in a way that makes the most sense to Mahout users...
First steps are to include some methods in Mahout that will do
that
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
preprocessing, and one that will turn something into a tsv
string.
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
I have some general ideas on possible approached to making an
honest-mahout
Post by Trevor Grant
interpreter but I want to play in the code and look at the
Flink-Mahout
Post by Trevor Grant
Post by Trevor Grant
shell a bit before I try to organize my thoughts and present
them.
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
FYI Trevor, there's no Flink-Mahout shell today; in large part
because
Post by Andrew Musselman
Post by Trevor Grant
the
Flink Shell is still busted on their end and we on the Mahout end
have
Post by Andrew Musselman
Post by Trevor Grant
not
had time to muck with it. What exists today is the Mahout-Spark
shell.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
...(2) not sure what is the point of supporting distributed
anything.
Post by Andrew Musselman
Post by Trevor Grant
It
is
Post by Trevor Grant
distributed presumably because it is hard to keep it in memory.
Therefore,
storage
Post by Trevor Grant
Post by Trevor Grant
space and overplotting due to number of points. The idea is that
we
Post by Andrew Musselman
Post by Trevor Grant
have
to
Post by Trevor Grant
work out algorithms that condense big data information into
small
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
plottable
Post by Trevor Grant
information (like density grids, for example, or histograms)....
Agreed, something like sampling x% of points from a DRM (like the
visuals I
had from Palumbo for the talk in Vancouver that demonstrated the
concept)
Post by Trevor Grant
Post by Trevor Grant
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by Trevor Grant
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth out the
sharp
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
edges
Post by Trevor Grant
and
any guidance from you or the Zeppelin community will be a big
help.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this in
zeppelin
Post by Trevor Grant
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
-Virgil*
Post by Trevor Grant
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I
mentioned
Post by Trevor Grant
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an
adventure
Post by Andrew Musselman
Post by Trevor Grant
in
Post by Trevor Grant
getting
Zeppelin to work right after I accidently updated to the
new
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
snapshot
Post by Trevor Grant
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil*
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look at
it
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
closely
Post by Trevor Grant
yet
but just a small point: I believe that you'll also need to
add
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Trevor Grant
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion
matrix,
Post by Andrew Musselman
Post by Trevor Grant
and
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments
below,
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
however
with
out further ado, I present two notebooks that integrate
Mahout
Post by Andrew Musselman
+
Post by Trevor Grant
Post by Trevor Grant
Spark
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin
0.6
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
with
Post by Trevor Grant
Post by Trevor Grant
sparkr
support running already, you may import the following raw
notes
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
So my thoughs on next steps, which I'm positing only as a
starting
Post by Trevor Grant
point
for discussion, and are in no particular order of
- Blog on HOWTO for everyman (assumes no familiarity with
Mahout,
Post by Trevor Grant
and
Post by Trevor Grant
only
enough familiarity with Zeppelin to have Zeppelin + SparkR
support)
Post by Trevor Grant
- Some syntactic sugar somewhere in Mahout to convert a
matrix
Post by Andrew Musselman
Post by Trevor Grant
into a
Post by Trevor Grant
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper
integration
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
feels
Post by Trevor Grant
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that
Zeppelin
Post by Trevor Grant
is
Post by Trevor Grant
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support
finding
Post by Andrew Musselman
Post by Trevor Grant
out
Post by Andrew Musselman
Post by Trevor Grant
what
Post by Trevor Grant
is
the
absolute minimum 'bare-bones' mahout we can include, e.g.
does
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Trevor Grant
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing
how to
Post by Andrew Musselman
Post by Trevor Grant
do
Post by Trevor Grant
the
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to
act
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
like
Post by Trevor Grant
a
Post by Trevor Grant
Mahout
interpretter
- This is taken care of by setting some env.
variables,
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
adding
Post by Trevor Grant
some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a
resource
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
pool
Post by Trevor Grant
Post by Trevor Grant
- This could be done to a disk if you didn't have
zeppelin
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
4) read the tsv from the resource pool (or disk if you
didn't
Post by Andrew Musselman
Post by Trevor Grant
have
Post by Trevor Grant
zeppelin) in R (python soon) and create a <plot package of
your
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
choice>
To Pat's point- this is a kind of clumsy pipeline, however
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python but
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular
integration).
Post by Trevor Grant
To
Post by Trevor Grant
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require
knowledge
Post by Trevor Grant
of
Post by Trevor Grant
R
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough but I
guess
Post by Trevor Grant
the
Post by Trevor Grant
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala
syntactic
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
sugar.
Post by Trevor Grant
What
and
Post by Pat Ferrel
how this all is installed and setup is the next question.
BTW this is what I use elsewhere (Mahout as a lib to this
code)
Post by Trevor Grant
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you have
to
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
serialize
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Trevor Grant
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself into
thinking
Post by Trevor Grant
things
Post by Pat Ferrel
were 'working' before because I wasn't actually
integrating.
Post by Andrew Musselman
Post by Trevor Grant
Lower I
Post by Trevor Grant
will
Post by Pat Ferrel
outline the imports/properties, please look over and tell
me
Post by Andrew Musselman
if
Post by Trevor Grant
I'm
Post by Trevor Grant
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object
that I
Post by Andrew Musselman
Post by Trevor Grant
can
Post by Trevor Grant
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to a
dataframe
Post by Trevor Grant
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some
syntactic
Post by Andrew Musselman
Post by Trevor Grant
sugar
Post by Trevor Grant
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to
zeppelin
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
(using
Post by Trevor Grant
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink)
and
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
an R
Post by Trevor Grant
Post by Trevor Grant
library
Post by Pat Ferrel
containing some functions which will pull the data out of
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting
package
Post by Trevor Grant
you
Post by Trevor Grant
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing with
matplotlib
Post by Trevor Grant
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5
)
Post by Trevor Grant
Post by Pat Ferrel
All of this doesn't necessarily require any changing of
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Zeppelin
Post by Trevor Grant
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up,
I'll
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
make a
Post by Trevor Grant
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using
imports
Post by Trevor Grant
in
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Zeppelin
Post by Trevor Grant
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's
angularJS.
Post by Andrew Musselman
Post by Trevor Grant
Things
Post by Trevor Grant
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an
optional
Post by Andrew Musselman
Post by Trevor Grant
build
Post by Trevor Grant
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables and
expose
Post by Trevor Grant
all
Post by Trevor Grant
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to Zeppelin
(which
Post by Trevor Grant
Post by Trevor Grant
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of
examples
Post by Trevor Grant
Post by Trevor Grant
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS then
some
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're
doing
Post by Andrew Musselman
Post by Trevor Grant
just
as
Post by Trevor Grant
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R or
Python
Post by Trevor Grant
and
Post by Trevor Grant
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro
(which
Post by Andrew Musselman
in
Post by Trevor Grant
part
Post by Trevor Grant
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was too
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
anything seems redundant or missing, please call it out.
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
spark.serializer
org.apache.spark.serializer.KryoSerializer
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Add artifacts (need to change these to maven not local,
also
Post by Andrew Musselman
Post by Trevor Grant
need
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like
registering
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark conf
for
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc
creation
Post by Trevor Grant
Post by Trevor Grant
code
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight Spark
and
Post by Andrew Musselman
Post by Trevor Grant
passed
Post by Trevor Grant
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain
cells
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
but
Post by Trevor Grant
I
Post by Trevor Grant
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than
teh
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Spark
Post by Trevor Grant
Post by Trevor Grant
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along with
an
Post by Andrew Musselman
sc.
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the priority,
The
Post by Andrew Musselman
Post by Trevor Grant
mahout
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some
point we
Post by Andrew Musselman
Post by Trevor Grant
may
Post by Trevor Grant
be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into the
mahout
Post by Trevor Grant
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do
something
Post by Andrew Musselman
Post by Trevor Grant
with
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it
(probably
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion
on
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
that
Post by Trevor Grant
Post by Trevor Grant
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and cc
dev
Post by Andrew Musselman
or
Post by Trevor Grant
Post by Trevor Grant
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[
https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
Post by Andrew Musselman
]<
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python,
Batchfile,
Post by Andrew Musselman
Post by Trevor Grant
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit
skeptical
Post by Andrew Musselman
Post by Trevor Grant
smile
Post by Trevor Grant
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way
around
Post by Andrew Musselman
is
Post by Trevor Grant
much
Post by Trevor Grant
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in
Zeppelin
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
without
Post by Trevor Grant
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all
around
Post by Andrew Musselman
Post by Trevor Grant
very
Post by Trevor Grant
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back on
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
thread.
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually
have
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
both
Post by Trevor Grant
Post by Trevor Grant
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future
we're
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
going
Post by Trevor Grant
to
Post by Trevor Grant
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple
scatter
Post by Andrew Musselman
Post by Trevor Grant
plots
Post by Trevor Grant
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and the
pngs.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
But what youre saying in your last email would be great!
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing
it...
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
OK. I'll read through the examples and try to do
something
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
with
Post by Trevor Grant
Post by Trevor Grant
some
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it (probably
ggplot).
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion on
that
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add another
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking
visualization
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
should
Post by Trevor Grant
be a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage
zeppelins
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're
trying
Post by Andrew Musselman
to
Post by Trevor Grant
is
Post by Trevor Grant
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We
plan
Post by Andrew Musselman
on
Post by Trevor Grant
Post by Trevor Grant
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've
just
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
reworked
Post by Trevor Grant
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid
and
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
surface
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based and
can
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples on the
closed
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile
as a
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are
looking
Post by Andrew Musselman
Post by Trevor Grant
at
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed the
new
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with
AngularJS
Post by Trevor Grant
if
Post by Trevor Grant
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output pngs,
maybe
Post by Trevor Grant
Post by Trevor Grant
other
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble
getting
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
permission
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I
pulled
Post by Andrew Musselman
in
Post by Trevor Grant
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be pulling
in
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
more?
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public)
this
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy- eg
why
Post by Andrew Musselman
not
Post by Trevor Grant
let
Post by Trevor Grant
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven
artifacts,
Post by Trevor Grant
Post by Trevor Grant
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the
races...
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
that said, adding a build profile like -PsparkMahout and
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly
straight
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be
more
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R just
to
Post by Andrew Musselman
Post by Trevor Grant
show
Post by Trevor Grant
off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth
building a
Post by Trevor Grant
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post
with
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
examples
Post by Trevor Grant
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much.
I've
Post by Andrew Musselman
Post by Trevor Grant
done
a
Post by Trevor Grant
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me
know if
Post by Andrew Musselman
Post by Trevor Grant
you
Post by Trevor Grant
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Trevor Grant
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while
ago.
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
(The
Post by Trevor Grant
old
Post by Trevor Grant
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually
getting
Post by Andrew Musselman
Post by Trevor Grant
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a properly
functioning
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could
leverage
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
all
Post by Trevor Grant
of
Post by Trevor Grant
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or
anything
Post by Andrew Musselman
in
Post by Trevor Grant
R
Post by Trevor Grant
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as
well.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of
Mahout
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara
is a
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a collection
of
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit
is
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
that
Post by Trevor Grant
Post by Trevor Grant
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by
nature
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and
supports
Post by Trevor Grant
Post by Trevor Grant
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo support
too.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
In any case you probably know we have a Mahout version
of
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Trevor Grant
Spark
Post by Trevor Grant
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of
Zeppelin
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
(code
Post by Trevor Grant
is
Post by Trevor Grant
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are
interested
Post by Trevor Grant
in
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I
understand
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Trevor Grant
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS,
which
Post by Andrew Musselman
I
Post by Trevor Grant
have
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to
proceed.
Post by Andrew Musselman
Post by Trevor Grant
We
Post by Trevor Grant
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our
conversation, I
Post by Trevor Grant
Post by Trevor Grant
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat
Ferrel
Post by Andrew Musselman
Post by Trevor Grant
(Mahout
Post by Trevor Grant
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at
Zeppelin
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate
your
Post by Andrew Musselman
Post by Trevor Grant
help
Post by Trevor Grant
(as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Mahout
Post by Trevor Grant
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org<
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all
hangout
Post by Andrew Musselman
Post by Trevor Grant
and
Post by Andrew Musselman
Post by Trevor Grant
not
Post by Trevor Grant
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Suneel Marthi
2016-05-20 21:55:34 UTC
Permalink
I concur. It seems to work for me the present Zeppelin snapshot version.
Post by Trevor Grant
If appears the jars aren't loading.
Did you add those artifacts?
If your version is the one cloned from tills that's fairly ancient.
I need to update that post badly.
Do a fresh git clone from apache/incubator-zeppelin the point of my last
post was to get flink 0.10 working w Zeppelin pre release. Zeppelin
snapshot is now on 1.0
Post by Andrew Musselman
In any case, still getting this error in the console when I run this
"import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext =
Post by Andrew Musselman
sc2sdc(sc)"
"<console>:21: error: object mahout is not a member of package org.apache
import org.apache.mahout.math._"
On Fri, May 20, 2016 at 2:31 PM, Andrew Musselman <
Post by Andrew Musselman
Ah, well I cloned the Till branch per your Nov 3 article..
git clone https://github.com/tillrohrmann/incubator-zeppelin.git
On Fri, May 20, 2016 at 2:28 PM, Trevor Grant <
Post by Trevor Grant
That's a "new" feature in the 0.6-snapshot... Say within the last
month
Post by Andrew Musselman
or
Post by Andrew Musselman
Post by Trevor Grant
two, how long has it been since you did a git pull?
I'll update soon with a note on that.
I can also create a gist with the code.
On May 20, 2016 4:24 PM, "Andrew Musselman" <
Post by Andrew Musselman
At this step of the tutorial I'm stuck because I don't have an
"Import
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
"I’m going to do you another favor. Go to the Zeppelin home page and
click
Post by Andrew Musselman
on ‘Import Note’. When given the option between URL and json, click
on
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
URL
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
"
On Fri, May 20, 2016 at 12:35 PM, Trevor Grant <
Post by Trevor Grant
Looks like Flink shell is fixed :D
https://github.com/apache/flink/pull/1913
(I tested, is working good).
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Andrew Musselman
Post by Trevor Grant
On Fri, May 20, 2016 at 1:46 PM, Suneel Marthi <
On Fri, May 20, 2016 at 12:54 PM, Trevor Grant <
Post by Trevor Grant
Dmitriy really nailed it on the head in his reply to the post
which
Post by Andrew Musselman
Post by Trevor Grant
I'll
Post by Trevor Grant
rebroadcast below. In essence the whole reason you are
(theoretically)
Post by Trevor Grant
Post by Trevor Grant
using Mahout is the data is to big to fit in memory. If it's
to
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
big
Post by Andrew Musselman
to
Post by Trevor Grant
fit
Post by Trevor Grant
in memory, well then its probably too big to plot each point
(e.g.
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
trillions of row, you only have so many pixels). For the
example I
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions that
will
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a function
that
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
will
Post by Andrew Musselman
Post by Trevor Grant
spit
Post by Trevor Grant
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is probably
worth
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
doing.
Post by Trevor Grant
There are a couple of ways to go about it. I opened up the
discussion
Post by Andrew Musselman
Post by Trevor Grant
on
that
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
to
Post by Andrew Musselman
Post by Trevor Grant
mean
we
Post by Trevor Grant
can do it in a way that makes the most sense to Mahout
users...
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
First steps are to include some methods in Mahout that will do
that
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
preprocessing, and one that will turn something into a tsv
string.
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
I have some general ideas on possible approached to making an
honest-mahout
Post by Trevor Grant
interpreter but I want to play in the code and look at the
Flink-Mahout
Post by Trevor Grant
Post by Trevor Grant
shell a bit before I try to organize my thoughts and present
them.
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
FYI Trevor, there's no Flink-Mahout shell today; in large part
because
Post by Andrew Musselman
Post by Trevor Grant
the
Flink Shell is still busted on their end and we on the Mahout
end
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
have
Post by Andrew Musselman
Post by Trevor Grant
not
had time to muck with it. What exists today is the Mahout-Spark
shell.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
...(2) not sure what is the point of supporting distributed
anything.
Post by Andrew Musselman
Post by Trevor Grant
It
is
Post by Trevor Grant
distributed presumably because it is hard to keep it in
memory.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Therefore,
storage
Post by Trevor Grant
Post by Trevor Grant
space and overplotting due to number of points. The idea is
that
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
we
Post by Andrew Musselman
Post by Trevor Grant
have
to
Post by Trevor Grant
work out algorithms that condense big data information into
small
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
plottable
Post by Trevor Grant
information (like density grids, for example, or
histograms)....
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Agreed, something like sampling x% of points from a DRM (like
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
visuals I
had from Palumbo for the talk in Vancouver that demonstrated the
concept)
Post by Trevor Grant
Post by Trevor Grant
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Trevor Grant
Post by Trevor Grant
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth out the
sharp
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
edges
Post by Trevor Grant
and
any guidance from you or the Zeppelin community will be a
big
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
help.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
On May 20, 2016, at 8:13 AM, Shannon Quinn <
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to try this
in
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
zeppelin
Post by Trevor Grant
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
-Virgil*
Post by Trevor Grant
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar that I
mentioned
Post by Trevor Grant
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on an
adventure
Post by Andrew Musselman
Post by Trevor Grant
in
Post by Trevor Grant
getting
Zeppelin to work right after I accidently updated to the
new
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
snapshot
Post by Trevor Grant
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil*
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able to look
at
Post by Andrew Musselman
it
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
closely
Post by Trevor Grant
yet
but just a small point: I believe that you'll also need
to
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
add
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Trevor Grant
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats, confusion
matrix,
Post by Andrew Musselman
Post by Trevor Grant
and
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's comments
below,
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
however
with
out further ado, I present two notebooks that integrate
Mahout
Post by Andrew Musselman
+
Post by Trevor Grant
Post by Trevor Grant
Spark
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of Zeppelin
0.6
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
with
Post by Trevor Grant
Post by Trevor Grant
sparkr
support running already, you may import the following
raw
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
notes
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
So my thoughs on next steps, which I'm positing only as
a
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
starting
Post by Trevor Grant
point
for discussion, and are in no particular order of
- Blog on HOWTO for everyman (assumes no familiarity
with
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Mahout,
Post by Trevor Grant
and
Post by Trevor Grant
only
enough familiarity with Zeppelin to have Zeppelin +
SparkR
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
support)
Post by Trevor Grant
- Some syntactic sugar somewhere in Mahout to convert a
matrix
Post by Andrew Musselman
Post by Trevor Grant
into a
Post by Trevor Grant
tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper
integration
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
feels
Post by Trevor Grant
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile is that
Zeppelin
Post by Trevor Grant
is
Post by Trevor Grant
first
and foremost a datascience tool for non technical users.
- If we go that route I'll need some more support
finding
Post by Andrew Musselman
Post by Trevor Grant
out
Post by Andrew Musselman
Post by Trevor Grant
what
Post by Trevor Grant
is
the
absolute minimum 'bare-bones' mahout we can include,
e.g.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
does
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Trevor Grant
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing
how to
Post by Andrew Musselman
Post by Trevor Grant
do
Post by Trevor Grant
the
same
thing in Python.
1) Setting up a standard Zeppelin Spark Interpretter to
act
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
like
Post by Trevor Grant
a
Post by Trevor Grant
Mahout
interpretter
- This is taken care of by setting some env.
variables,
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
adding
Post by Trevor Grant
some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a
resource
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
pool
Post by Trevor Grant
Post by Trevor Grant
- This could be done to a disk if you didn't have
zeppelin
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
4) read the tsv from the resource pool (or disk if you
didn't
Post by Andrew Musselman
Post by Trevor Grant
have
Post by Trevor Grant
zeppelin) in R (python soon) and create a <plot package
of
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
your
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
choice>
To Pat's point- this is a kind of clumsy pipeline,
however
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or python
but
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for Angular
integration).
Post by Trevor Grant
To
Post by Trevor Grant
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not require
knowledge
Post by Trevor Grant
of
Post by Trevor Grant
R
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad enough
but I
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
guess
Post by Trevor Grant
the
Post by Trevor Grant
API
Post by Pat Ferrel
from the Mahout side for plotting could be Scala
syntactic
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
sugar.
Post by Trevor Grant
What
and
Post by Pat Ferrel
how this all is installed and setup is the next
question.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
BTW this is what I use elsewhere (Mahout as a lib to
this
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
code)
Post by Trevor Grant
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when you
have
Post by Andrew Musselman
to
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
serialize
a
Post by Pat Ferrel
mahout specific data type like vector of drm, something
registered
Post by Trevor Grant
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage Zeppelin for
charting.
Post by Trevor Grant
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark interpreter
- importing in a spark paragraph
All seems to be working well, but I've fooled myself
into
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
thinking
Post by Trevor Grant
things
Post by Pat Ferrel
were 'working' before because I wasn't actually
integrating.
Post by Andrew Musselman
Post by Trevor Grant
Lower I
Post by Trevor Grant
will
Post by Pat Ferrel
outline the imports/properties, please look over and
tell
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
me
Post by Andrew Musselman
if
Post by Trevor Grant
I'm
Post by Trevor Grant
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of serializable object
that I
Post by Andrew Musselman
Post by Trevor Grant
can
Post by Trevor Grant
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the object
3) collect the object in an R paragraph, convert it to
a
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
dataframe
Post by Trevor Grant
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add some
syntactic
Post by Andrew Musselman
Post by Trevor Grant
sugar
Post by Trevor Grant
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass to
zeppelin
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
(using
Post by Trevor Grant
resource
Post by Pat Ferrel
pools so the same functionality can be reused in Flink)
and
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
an R
Post by Trevor Grant
Post by Trevor Grant
library
Post by Pat Ferrel
containing some functions which will pull the data out
of
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any plotting
package
Post by Trevor Grant
you
Post by Trevor Grant
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same thing
with
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
matplotlib
Post by Trevor Grant
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5
)
Post by Trevor Grant
Post by Pat Ferrel
All of this doesn't necessarily require any changing of
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Zeppelin
Post by Trevor Grant
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to set up,
I'll
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
make a
Post by Trevor Grant
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial on using
imports
Post by Trevor Grant
in
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at home on
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Zeppelin
Post by Trevor Grant
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using Zeppelin's
angularJS.
Post by Andrew Musselman
Post by Trevor Grant
Things
Post by Trevor Grant
get
a
Post by Pat Ferrel
little more harry in that case, but we could make an
optional
Post by Andrew Musselman
Post by Trevor Grant
build
Post by Trevor Grant
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at tables
and
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
expose
Post by Trevor Grant
all
Post by Trevor Grant
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts to
Zeppelin
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
(which
Post by Trevor Grant
Post by Trevor Grant
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with a lot of
examples
Post by Trevor Grant
Post by Trevor Grant
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to AngularJS
then
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
some
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however, you're
doing
Post by Andrew Musselman
Post by Trevor Grant
just
as
Post by Trevor Grant
much
Post by Pat Ferrel
work, if not more than it would be to simply pass to R
or
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Python
Post by Trevor Grant
and
Post by Trevor Grant
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet using Kyro
(which
Post by Andrew Musselman
in
Post by Trevor Grant
part
Post by Trevor Grant
is
Post by Pat Ferrel
what makes me fear I'm not doing this right... it was
too
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
anything seems redundant or missing, please call it
out.
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
spark.serializer
org.apache.spark.serializer.KryoSerializer
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Add artifacts (need to change these to maven not local,
also
Post by Andrew Musselman
Post by Trevor Grant
need
to
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
=
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup, like
registering
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the Spark
conf
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
for
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code in the mc
creation
Post by Trevor Grant
Post by Trevor Grant
code
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in straight
Spark
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
and
Post by Andrew Musselman
Post by Trevor Grant
passed
Post by Trevor Grant
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak brain
cells
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
but
Post by Trevor Grant
I
Post by Trevor Grant
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell different than
teh
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Spark
Post by Trevor Grant
Post by Trevor Grant
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or along
with
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
an
Post by Andrew Musselman
sc.
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be having?
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the
priority,
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
The
Post by Andrew Musselman
Post by Trevor Grant
mahout
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but at some
point we
Post by Andrew Musselman
Post by Trevor Grant
may
Post by Trevor Grant
be
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features into
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
mahout
Post by Trevor Grant
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to do
something
Post by Andrew Musselman
Post by Trevor Grant
with
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot on it
(probably
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion
on
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
that
Post by Trevor Grant
Post by Trevor Grant
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add
another
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin work
I just signed up for dev, should i just reply all and
cc
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
dev
Post by Andrew Musselman
or
Post by Trevor Grant
Post by Trevor Grant
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[
https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
Post by Andrew Musselman
]<
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in Python,
Batchfile,
Post by Andrew Musselman
Post by Trevor Grant
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <
fwiw ggplot2 is pretty darn advanced:) i am a bit
skeptical
Post by Andrew Musselman
Post by Trevor Grant
smile
Post by Trevor Grant
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the other way
around
Post by Andrew Musselman
is
Post by Trevor Grant
much
Post by Trevor Grant
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available in
Zeppelin
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
without
Post by Trevor Grant
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be an all
around
Post by Andrew Musselman
Post by Trevor Grant
very
Post by Trevor Grant
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <
sorry- answering a question from a couple emails back
on
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
thread.
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to eventually
have
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
both
Post by Trevor Grant
Post by Trevor Grant
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the future
we're
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
going
Post by Trevor Grant
to
Post by Trevor Grant
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than simple
scatter
Post by Andrew Musselman
Post by Trevor Grant
plots
Post by Trevor Grant
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using angular and
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
pngs.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
But what youre saying in your last email would be
great!
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
I somehow replied to your last email without seeing
it...
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
OK. I'll read through the examples and try to do
something
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
with
Post by Trevor Grant
Post by Trevor Grant
some
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it
(probably
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
ggplot).
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen discussion
on
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
that
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and add
another
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <
sorry for double email but are you thinking
visualization
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
should
Post by Trevor Grant
be a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we leverage
zeppelins
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <
Sorry- to be a little more clear, Part of what we're
trying
Post by Andrew Musselman
to
Post by Trevor Grant
is
Post by Trevor Grant
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with Zeppelin. We
plan
Post by Andrew Musselman
on
Post by Trevor Grant
Post by Trevor Grant
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
Awesome!
most of the hard work was done by Dmitriy[??] , I've
just
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
reworked
Post by Trevor Grant
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's refactoring.
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
For the new plotting features that we're working on.
the plotting is still a work in progress, and the grid
and
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
surface
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing based
and
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
can
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples on the
closed
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin work
This is only the beginning. Andy has been using Smile
as a
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We are
looking
Post by Andrew Musselman
Post by Trevor Grant
at
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to feed
the
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
new
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly familiar with
AngularJS
Post by Trevor Grant
if
Post by Trevor Grant
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can output
pngs,
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
maybe
Post by Trevor Grant
Post by Trevor Grant
other
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has rouble
getting
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
permission
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work for me!
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over import. I
pulled
Post by Andrew Musselman
in
Post by Trevor Grant
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I be
pulling
Post by Andrew Musselman
in
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
more?
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't public)
this
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too easy-
eg
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
why
Post by Andrew Musselman
not
Post by Trevor Grant
let
Post by Trevor Grant
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate maven
artifacts,
Post by Trevor Grant
Post by Trevor Grant
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to the
races...
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
that said, adding a build profile like -PsparkMahout
and
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be fairly
straight
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that would be
more
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular or R
just
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
to
Post by Andrew Musselman
Post by Trevor Grant
show
Post by Trevor Grant
off
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even worth
building a
Post by Trevor Grant
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice blog post
with
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
examples
Post by Trevor Grant
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <
Hi Trevor, welcome!
It's great to have you helping out, thanks very much.
I've
Post by Andrew Musselman
Post by Trevor Grant
done
a
Post by Trevor Grant
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so let me
know if
Post by Andrew Musselman
Post by Trevor Grant
you
Post by Trevor Grant
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Trevor Grant
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin work
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin a while
ago.
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
(The
Post by Trevor Grant
old
Post by Trevor Grant
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <
Hey all,
I'm excited for a chance to help out. I'm actually
getting
Post by Andrew Musselman
Post by Trevor Grant
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a
properly
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
functioning
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one could
leverage
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
all
Post by Trevor Grant
of
Post by Trevor Grant
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in AngularJS, or
anything
Post by Andrew Musselman
in
Post by Trevor Grant
R
Post by Trevor Grant
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack channel as
well.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Nice to meet you all, looking forward to helping out!
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org<http://trevorgrant.org/>
"Fortunate is he, who is able to know the causes of
things."
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <
FYi...
Trevor was there for my talk, so he has some idea of
Mahout
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know Mahout-Samsara
is a
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a
collection
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
of
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major benefit
is
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
that
Post by Trevor Grant
Post by Trevor Grant
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the code is by
nature
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is R-like and
supports
Post by Trevor Grant
Post by Trevor Grant
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online algo
support
Post by Andrew Musselman
Post by Andrew Musselman
Post by Trevor Grant
too.
Post by Andrew Musselman
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
In any case you probably know we have a Mahout version
of
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Trevor Grant
Spark
Post by Trevor Grant
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version of
Zeppelin
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
(code
Post by Trevor Grant
is
Post by Trevor Grant
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very nice
visualizations
Post by Trevor Grant
of
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a project are
interested
Post by Trevor Grant
in
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From what I
understand
Post by Andrew Musselman
Post by Trevor Grant
the
Post by Trevor Grant
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on AngularJS,
which
Post by Andrew Musselman
I
Post by Trevor Grant
have
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about how to
proceed.
Post by Andrew Musselman
Post by Trevor Grant
We
Post by Trevor Grant
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any case.
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per our
conversation, I
Post by Trevor Grant
Post by Trevor Grant
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair) and Pat
Ferrel
Post by Andrew Musselman
Post by Trevor Grant
(Mahout
Post by Trevor Grant
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively looking at
Zeppelin
Post by Trevor Grant
Post by Trevor Grant
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would appreciate
your
Post by Andrew Musselman
Post by Trevor Grant
help
Post by Trevor Grant
(as
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r revamping
the
Post by Andrew Musselman
Post by Trevor Grant
Post by Andrew Musselman
Post by Trevor Grant
Mahout
Post by Trevor Grant
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack channel,
mahout.apache.org<
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we all
hangout
Post by Andrew Musselman
Post by Trevor Grant
and
Post by Andrew Musselman
Post by Trevor Grant
not
Post by Trevor Grant
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Dmitriy Lyubimov
2016-05-19 20:54:40 UTC
Permalink
Trevor, left a comment on your blog before realizing i should've really be
commenting here...

-d
In mahout 0.13 well be looking row reduction methods other than just
sampling to transform DRM -> matrix so that it fits in memory. This is
great!
-------- Original message --------
Date: 05/19/2016 12:02 AM (GMT-05:00)
Subject: RE: Future Mahout - Zeppelin work
Well done, Trevor! I've not yet had a chance to try this in zeppelin but
I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/raw<
http://stackexchange.com/users/3002022/rawkintrevo>
Dmitriy Lyubimov
2016-05-19 20:55:10 UTC
Permalink
Trevor, terrific job on zeppelin post btw. thanks!
Post by Dmitriy Lyubimov
Trevor, left a comment on your blog before realizing i should've really be
commenting here...
-d
In mahout 0.13 well be looking row reduction methods other than just
sampling to transform DRM -> matrix so that it fits in memory. This is
great!
-------- Original message --------
Date: 05/19/2016 12:02 AM (GMT-05:00)
Subject: RE: Future Mahout - Zeppelin work
Well done, Trevor! I've not yet had a chance to try this in zeppelin but
I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/raw<
http://stackexchange.com/users/3002022/rawkintrevo>
Dmitriy Lyubimov
2016-05-19 20:57:40 UTC
Permalink
still in reply to the blog: i wish though zeppelin had a true mahout
interpreter. all it basically requires is to reuse spark settings but
execute proper imports and context init, and provide this "tablify" routine
somehow.
Post by Dmitriy Lyubimov
Trevor, terrific job on zeppelin post btw. thanks!
Post by Dmitriy Lyubimov
Trevor, left a comment on your blog before realizing i should've really
be commenting here...
-d
In mahout 0.13 well be looking row reduction methods other than just
sampling to transform DRM -> matrix so that it fits in memory. This is
great!
-------- Original message --------
Date: 05/19/2016 12:02 AM (GMT-05:00)
Subject: RE: Future Mahout - Zeppelin work
Well done, Trevor! I've not yet had a chance to try this in zeppelin
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/raw<
http://stackexchange.com/users/3002022/rawkintrevo>
Trevor Grant
2016-05-31 13:54:56 UTC
Permalink
I opened a PR against Zeppelin anyway:
https://github.com/apache/incubator-zeppelin/pull/928

Mainly to force the conversation since no one on ***@zeppelin had anything
to say.

The main push back was not on messing with the Spark interpreter, but "why
a patch instead of just documentation, (i.e the blog post)."

I'm responding to Moon on the PR, but would encourage others to join the
conversation with reasons a 'baked in' Mahout interpreter is preferable to
the existing blog post.

tg



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Pat Ferrel
-------- Original message --------
Date: 05/29/2016 1:16 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
OK cool. Just wanted to make sure I wasn't stealing anyone's baby or
duplicating efforts.
1- The blog post referenced the linear-regression example notebook twice-
I've updated it to reference the ggplot integration. E.g. import this
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
(I still need to update with a blurb about sampling, however it is done
in
that note...) So to any who tried the blog, I huge appology because that
notebook is where all of the 'magic happened', (all of the screen shots /
gg-plots / etc happened there).
https://github.com/rawkintrevo/incubator-zeppelin
if you build, and set 'spark.mahout' to 'true' in the Spark Interpretter
properties, you have a Mahout interpreter. This is the minimally invasive
way to do it, I'll be opening a PR soon, we'll see what the gang over at
Zeppelin say.
I'll still need docs and an example notebook, but I'm waiting to make
sure
I don't need to do a major refactor before I get carried away with those
activities.
In essence when 'spark-mahout' is 'true' you jump right in on r-like dsl
and you have a sdc declared based on the underlying sc.
I am not sure if messing with the very "sacrosanct" Zeppelin-Spark
interpreter is gonna go down well with the Spark insanity. I would prefer
having a separate MAhout-Spark-Zeppelin interpreter under Zeppelin project
if that's acceptable to the Zeppelin folks, even though most of it might be
repeatee.
What do others have to say?
I agree that a Mahout-Spark-Zeppelin interpreter would be ideal.
Have a good weekend all.
have a good holiday weekend,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Thx Trevor,
Re: m-1854, It was something that we started when were first discussing
using the smile plots for and trying to pipe them over to Zeppelin ..
As
far as I know there was not progress started on it.. I've unassigned
it.
Feel free to Assign any Jiras to yourself. I think that m-1854 is
similar
to the mahout-spark-shell, so I may be able to help out there.
________________________________________
Sent: Saturday, May 28, 2016 11:21:44 PM
Subject: Re: Future Mahout - Zeppelin work
Created a subtask on 1855 for tsv strings.
Looking at 1854 assigned to Pat Ferrel, what's your progress to date?
How
can I help?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Great!
When you free up and have the time, could you create some Jiras for
these?
We actually have MAHOUT-1852 open for Histograms already, and
MAHOUT-1854
and MAHOUT-1855 (early Zeppelin integration Jiras). I can close
m-1854
and
m-1855 out and we can start new ones if they're not relevant anymore
or
we
can just go with those.
Thanks
________________________________________
Sent: Thursday, May 26, 2016 3:17:22 PM
Subject: Re: Future Mahout - Zeppelin work
Short answer: it is high priority. I think it will be a Mahout
interpreter
into Zeppelin, and given that plans are on hold for a Flink-Mahout in
the
short term, I think it should be a piggy-back spark interpreter (e.g.
exposed through something like %spark.mahout). So I have thoughts,
but
no
plan. Been busy with a couple of other commitments.
A function that will convert small matrices into TSV strings
Convenience functions for sampling super-large matrices into things
like
histograms, etc, that one would want to plot. I.e. histogram
bucketing?
(less important for the moment)
an interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Suneel Marthi
While on this subject, do we have a plan yet of integrating
Zeppelin
into
Post by Suneel Marthi
Mahout (or the converse) of having Mahout specific interpreter for
Zeppelin? I think that shuld be high priority in the short term.
On Thu, May 26, 2016 at 1:17 PM, Trevor Grant <
Post by Trevor Grant
Ahh, like the "Sample From Matrix" paragraph in the notebook.
Yea that seems like a good add. If not this afternoon, I'll
include
it
Post by Suneel Marthi
Post by Trevor Grant
Saturday.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Post by Suneel Marthi
Post by Trevor Grant
On Thu, May 26, 2016 at 11:52 AM, Andrew Palumbo <
Trevor, I was reading over your blog last night again- first
time
since
Post by Suneel Marthi
Post by Trevor Grant
you updated. It is great!
I have one suggestion being adding in a code line on how the
the
Post by Suneel Marthi
sampling
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148
Post by Suneel Marthi
Post by Trevor Grant
mxSin = drmSampleKRows(drmSin, 1000, replacement = false)
Maybe you omitted this intentionally?
Andy
________________________________________
Sent: Friday, May 20, 2016 7:56:20 PM
Subject: Re: Future Mahout - Zeppelin work
Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a
version
Post by Suneel Marthi
Post by Trevor Grant
is
uninformative to me. I'd say if possible, you're first
troubleshooting
Post by Suneel Marthi
Post by Trevor Grant
measure would be to re clone or do a "git fetch upstream" to
get
up
to
Post by Suneel Marthi
Post by Trevor Grant
the
very latest
Sorry for delayed reply
Tg
On May 20, 2016 5:36 PM, "Andrew Musselman" <
Post by Andrew Musselman
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the
dependencies;
Post by Suneel Marthi
Post by Trevor Grant
is
Post by Andrew Musselman
that a feature in more modern zep?
On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov <
Post by Dmitriy Lyubimov
no parenthesis.
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi <
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
Post by Trevor Grant
Hey Pat,
If you spit out a TSV - you can import into pyspark /
matplotlib
Post by Trevor Grant
from
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
resource pool in essentially the same way and use that
plotting
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
library
Post by Dmitriy Lyubimov
if
Post by Trevor Grant
you prefer. In fact you could import the tsv into
pandas
and
Post by Suneel Marthi
use
Post by Trevor Grant
all
Post by Andrew Musselman
Post by Dmitriy Lyubimov
of
Post by Trevor Grant
the pandas plotting as well (though I think it is for
the
most
Post by Suneel Marthi
Post by Trevor Grant
part,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
also
Post by Trevor Grant
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
In Zeppelin, unless you specify otherwise, pyspark,
sparkr,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
spark-sql,
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
scala-spark all share the same spark context you can
create
Post by Suneel Marthi
RDDs
Post by Trevor Grant
in
Post by Andrew Musselman
one
Post by Dmitriy Lyubimov
Post by Trevor Grant
language and access them / work on them in another (so
I
Post by Suneel Marthi
Post by Trevor Grant
understand).
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
So in Mahout can you "save" a matrix as a RDD? e.g.
something
Post by Suneel Marthi
Post by Trevor Grant
like
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
Post by Trevor Grant
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
-Virgil*
Post by Dmitriy Lyubimov
Post by Trevor Grant
On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
Post by Pat Ferrel
Agreed.
BTW I don’t want to stall progress but being the most
ignorant
Post by Trevor Grant
of
Post by Andrew Musselman
Post by Dmitriy Lyubimov
plot
Post by Trevor Grant
Post by Pat Ferrel
libs, I’ll ask if we should consider python and
matplotlib.
Post by Suneel Marthi
In
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
another
Post by Trevor Grant
Post by Pat Ferrel
project we use python because of the RDD support on
Spark
Post by Suneel Marthi
Post by Trevor Grant
though
Post by Andrew Musselman
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
visualizations are extremely limited in our case. If
we
can
Post by Suneel Marthi
Post by Trevor Grant
pass
an
Post by Andrew Musselman
Post by Dmitriy Lyubimov
RDD
Post by Trevor Grant
to
Post by Pat Ferrel
pyspark it would allow custom reductions in python
before
Post by Suneel Marthi
Post by Trevor Grant
plotting,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
even
Post by Trevor Grant
Post by Pat Ferrel
though we will support many natively in Mahout. I’m
guessing
Post by Suneel Marthi
Post by Trevor Grant
that
Post by Andrew Musselman
Post by Dmitriy Lyubimov
this
Post by Trevor Grant
Post by Pat Ferrel
would cross a context boundary and require a write to
disk?
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
1) what does the inter language support look like
with
Spark
Post by Suneel Marthi
Post by Trevor Grant
python
Post by Andrew Musselman
Post by Dmitriy Lyubimov
vs
Post by Trevor Grant
Post by Pat Ferrel
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
On May 20, 2016, at 9:54 AM, Trevor Grant <
Dmitriy really nailed it on the head in his reply to
the
post
Post by Suneel Marthi
Post by Trevor Grant
which
Post by Andrew Musselman
Post by Dmitriy Lyubimov
I'll
Post by Trevor Grant
Post by Pat Ferrel
rebroadcast below. In essence the whole reason you
are
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(theoretically)
Post by Trevor Grant
Post by Pat Ferrel
using Mahout is the data is to big to fit in memory.
If
it's
Post by Suneel Marthi
Post by Trevor Grant
to
Post by Andrew Musselman
big
Post by Dmitriy Lyubimov
to
Post by Trevor Grant
fit
Post by Pat Ferrel
in memory, well then its probably too big to plot
each
point
Post by Suneel Marthi
Post by Trevor Grant
(e.g.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
trillions of row, you only have so many pixels).
For
the
Post by Suneel Marthi
Post by Trevor Grant
example
Post by Andrew Musselman
I
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have
functions
that
Post by Suneel Marthi
Post by Trevor Grant
will
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a
function
Post by Suneel Marthi
that
Post by Trevor Grant
Post by Andrew Musselman
will
Post by Dmitriy Lyubimov
spit
Post by Trevor Grant
Post by Pat Ferrel
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is
probably
Post by Suneel Marthi
Post by Trevor Grant
worth
Post by Andrew Musselman
Post by Dmitriy Lyubimov
doing.
Post by Trevor Grant
Post by Pat Ferrel
There are a couple of ways to go about it. I opened
up
the
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
discussion
Post by Dmitriy Lyubimov
on
to
take
Post by Suneel Marthi
Post by Trevor Grant
that
to
Post by Andrew Musselman
Post by Dmitriy Lyubimov
mean
Post by Trevor Grant
we
Post by Pat Ferrel
can do it in a way that makes the most sense to
Mahout
Post by Suneel Marthi
users...
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
First steps are to include some methods in Mahout
that
will
Post by Suneel Marthi
do
Post by Trevor Grant
that
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
preprocessing, and one that will turn something into
a
tsv
Post by Suneel Marthi
Post by Trevor Grant
string.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
I have some general ideas on possible approached to
making
an
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
honest-mahout
Post by Pat Ferrel
interpreter but I want to play in the code and look
at
the
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Flink-Mahout
Post by Trevor Grant
Post by Pat Ferrel
shell a bit before I try to organize my thoughts and
present
Post by Suneel Marthi
Post by Trevor Grant
them.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
...(2) not sure what is the point of supporting
distributed
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
anything.
Post by Dmitriy Lyubimov
It
Post by Trevor Grant
is
Post by Pat Ferrel
distributed presumably because it is hard to keep it
in
Post by Suneel Marthi
memory.
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Therefore,
Post by Pat Ferrel
plotting anything distributed potentially presents 2
storage
Post by Trevor Grant
Post by Pat Ferrel
space and overplotting due to number of points. The
idea
is
Post by Suneel Marthi
Post by Trevor Grant
that
we
Post by Andrew Musselman
Post by Dmitriy Lyubimov
have
Post by Trevor Grant
to
Post by Pat Ferrel
work out algorithms that condense big data
information
into
Post by Suneel Marthi
Post by Trevor Grant
small
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
plottable
Post by Pat Ferrel
information (like density grids, for example, or
histograms)....
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
-Virgil*
Post by Trevor Grant
Post by Pat Ferrel
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
Great job Trevor, we’ll need this detail to smooth
out
the
Post by Suneel Marthi
Post by Trevor Grant
sharp
Post by Andrew Musselman
Post by Dmitriy Lyubimov
edges
Post by Trevor Grant
Post by Pat Ferrel
and
any guidance from you or the Zeppelin community
will
be a
Post by Suneel Marthi
big
Post by Trevor Grant
Post by Andrew Musselman
help.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
On May 20, 2016, at 8:13 AM, Shannon Quinn <
Agreed, thoroughly enjoying the blog post.
Well done, Trevor! I've not yet had a chance to
try
this
Post by Suneel Marthi
in
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
zeppelin
Post by Trevor Grant
Post by Pat Ferrel
but I just read the blog which is great!
-------- Original message --------
Date: 05/18/2016 2:44 PM (GMT-05:00)
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
Hey Trevor- Just refreshed your readme. The jar
that I
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
mentioned
Post by Dmitriy Lyubimov
is
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
(In the spark module that is)
________________________________________
Sent: Wednesday, May 18, 2016 11:02:43 AM
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me
too.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
I got side tracked yesterday for most of the day
on
an
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
adventure
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
getting
Zeppelin to work right after I accidently updated
to
the
Post by Suneel Marthi
Post by Trevor Grant
new
Post by Andrew Musselman
Post by Dmitriy Lyubimov
snapshot
Post by Trevor Grant
Post by Pat Ferrel
(free
hint: the secret was to clear my cache
*face-palm*)
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
I'm going to add that dependency to the
readme.md
now.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
-Virgil*
Post by Pat Ferrel
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
Trevor this is very cool- I have not been able
to
look
Post by Suneel Marthi
at
Post by Trevor Grant
it
Post by Andrew Musselman
Post by Dmitriy Lyubimov
closely
Post by Trevor Grant
Post by Pat Ferrel
yet
but just a small point: I believe that you'll
also
need
Post by Suneel Marthi
to
Post by Trevor Grant
add
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
For things like the classification stats,
confusion
Post by Suneel Marthi
Post by Trevor Grant
matrix,
Post by Andrew Musselman
and
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
t-digest.
Andy
________________________________________
Sent: Wednesday, May 18, 2016 10:47:21 AM
Subject: Re: Future Mahout - Zeppelin work
I still need to update my readme/env per Pat's
comments
Post by Suneel Marthi
Post by Trevor Grant
below,
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
however
Post by Pat Ferrel
with
out further ado, I present two notebooks that
integrate
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Mahout +
Post by Dmitriy Lyubimov
Post by Trevor Grant
Spark
Post by Pat Ferrel
+
Zeppelin + ggplot2
https://github.com/rawkintrevo/mahout-zeppelin
Supposing you have a somewhat recent version of
Zeppelin
Post by Suneel Marthi
Post by Trevor Grant
0.6
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
Post by Pat Ferrel
sparkr
support running already, you may import the
following
Post by Suneel Marthi
raw
Post by Trevor Grant
Post by Andrew Musselman
notes
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
directly
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
So my thoughs on next steps, which I'm positing
only
as
Post by Suneel Marthi
a
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
starting
Post by Trevor Grant
Post by Pat Ferrel
point
for discussion, and are in no particular order
of
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
- Blog on HOWTO for everyman (assumes no
familiarity
Post by Suneel Marthi
with
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Mahout,
Post by Trevor Grant
and
Post by Pat Ferrel
only
enough familiarity with Zeppelin to have
Zeppelin
+
Post by Suneel Marthi
SparkR
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
support)
Post by Trevor Grant
Post by Pat Ferrel
- Some syntactic sugar somewhere in Mahout to
convert
a
Post by Suneel Marthi
Post by Trevor Grant
matrix
Post by Andrew Musselman
Post by Dmitriy Lyubimov
into
Post by Trevor Grant
a
Post by Pat Ferrel
tsv
string. (with some sanity, eg a sample of a
matrix)
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
- Figure out with Zeppelin community what deeper
integration
Post by Andrew Musselman
Post by Dmitriy Lyubimov
feels
Post by Trevor Grant
Post by Pat Ferrel
like -
e.g. build-profile vs. tutorial
- I think the case for making a build-profile
is
that
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Zeppelin
Post by Dmitriy Lyubimov
is
Post by Trevor Grant
Post by Pat Ferrel
first
and foremost a datascience tool for non
technical
users.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
- If we go that route I'll need some more
support
Post by Suneel Marthi
finding
Post by Trevor Grant
out
Post by Andrew Musselman
Post by Dmitriy Lyubimov
what
Post by Trevor Grant
is
Post by Pat Ferrel
the
absolute minimum 'bare-bones' mahout we can
include,
Post by Suneel Marthi
e.g.
Post by Trevor Grant
does
Post by Andrew Musselman
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph
showing
Post by Suneel Marthi
Post by Trevor Grant
how
Post by Andrew Musselman
to
Post by Dmitriy Lyubimov
do
Post by Trevor Grant
the
Post by Pat Ferrel
same
thing in Python.
1) Setting up a standard Zeppelin Spark
Interpretter
to
Post by Suneel Marthi
Post by Trevor Grant
act
Post by Andrew Musselman
Post by Dmitriy Lyubimov
like a
Post by Trevor Grant
Post by Pat Ferrel
Mahout
interpretter
- This is taken care of by setting some env.
variables,
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
adding
Post by Trevor Grant
some
Post by Pat Ferrel
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed
to
a
Post by Suneel Marthi
Post by Trevor Grant
resource
Post by Andrew Musselman
Post by Dmitriy Lyubimov
pool
Post by Trevor Grant
Post by Pat Ferrel
- This could be done to a disk if you didn't
have
Post by Suneel Marthi
Post by Trevor Grant
zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
4) read the tsv from the resource pool (or disk
if
you
Post by Suneel Marthi
Post by Trevor Grant
didn't
Post by Andrew Musselman
Post by Dmitriy Lyubimov
have
Post by Trevor Grant
Post by Pat Ferrel
zeppelin) in R (python soon) and create a <plot
package
Post by Suneel Marthi
of
Post by Trevor Grant
Post by Andrew Musselman
your
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
choice>
To Pat's point- this is a kind of clumsy
pipeline,
Post by Suneel Marthi
however
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
wrapper at least makes it *feel* less so.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
http://trevorgrant.org
*"Fortunate is he, who is able to know the
causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil*
On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <
Post by Pat Ferrel
Seems like there is plenty to use in ggplot or
python
Post by Suneel Marthi
but
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
pipeline
is
Post by Pat Ferrel
a little convoluted (so maybe no need for
Angular
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
integration).
Post by Dmitriy Lyubimov
To
Post by Trevor Grant
Post by Pat Ferrel
get
Post by Pat Ferrel
graphics out of Mahout it would be nice to not
require
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
knowledge
Post by Trevor Grant
of R
Post by Pat Ferrel
Post by Pat Ferrel
and/or python. Knowing Mahout is already bad
enough
Post by Suneel Marthi
but I
Post by Trevor Grant
Post by Andrew Musselman
guess
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
API
Post by Pat Ferrel
from the Mahout side for plotting could be
Scala
Post by Suneel Marthi
Post by Trevor Grant
syntactic
Post by Andrew Musselman
Post by Dmitriy Lyubimov
sugar.
Post by Trevor Grant
Post by Pat Ferrel
What
and
Post by Pat Ferrel
how this all is installed and setup is the next
question.
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
BTW this is what I use elsewhere (Mahout as a
lib
to
Post by Suneel Marthi
this
Post by Trevor Grant
Post by Andrew Musselman
code)
Post by Dmitriy Lyubimov
Post by Trevor Grant
"org.apache.spark.serializer.KryoSerializer",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m”,
afaik you will only see if Kryo is working when
you
Post by Suneel Marthi
have
Post by Trevor Grant
to
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
serialize
Post by Pat Ferrel
a
Post by Pat Ferrel
mahout specific data type like vector of drm,
something
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
registered
Post by Trevor Grant
Post by Pat Ferrel
with
Post by Pat Ferrel
Kryo.
On May 16, 2016, at 6:18 PM, Trevor Grant <
As a quick recap- we're trying to leverage
Zeppelin
for
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
charting.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
It seems as though this can be achieved by
- Adding properties to the Spark Interpreter
- Adding dependency jars to the spark
interpreter
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
- importing in a spark paragraph
All seems to be working well, but I've fooled
myself
Post by Suneel Marthi
into
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
thinking
Post by Trevor Grant
Post by Pat Ferrel
things
Post by Pat Ferrel
were 'working' before because I wasn't actually
integrating.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Lower
Post by Trevor Grant
I
Post by Pat Ferrel
will
Post by Pat Ferrel
outline the imports/properties, please look
over
and
Post by Suneel Marthi
tell
Post by Trevor Grant
me
Post by Andrew Musselman
if
Post by Dmitriy Lyubimov
I'm
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
theoretically missing anything.
The next phase for me will be
1) Convert a matrix to some sort of
serializable
object
Post by Suneel Marthi
Post by Trevor Grant
that
Post by Andrew Musselman
I
Post by Dmitriy Lyubimov
can
Post by Trevor Grant
Post by Pat Ferrel
easily
Post by Pat Ferrel
unpack from R
2) use Zeppelin's resource buffers to pass the
object
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
3) collect the object in an R paragraph,
convert
it
to
Post by Suneel Marthi
a
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
dataframe
Post by Trevor Grant
Post by Pat Ferrel
then
map
Post by Pat Ferrel
using ggplot
Once I have a working prototype I will work add
some
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
syntactic
Post by Dmitriy Lyubimov
Post by Trevor Grant
sugar
Post by Pat Ferrel
to
Post by Pat Ferrel
prepare the matrix from the scala side and pass
to
Post by Suneel Marthi
Post by Trevor Grant
zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(using
Post by Trevor Grant
Post by Pat Ferrel
resource
Post by Pat Ferrel
pools so the same functionality can be reused
in
Flink)
Post by Suneel Marthi
Post by Trevor Grant
and
Post by Andrew Musselman
an
Post by Dmitriy Lyubimov
R
Post by Trevor Grant
Post by Pat Ferrel
library
Post by Pat Ferrel
containing some functions which will pull the
data
out
Post by Suneel Marthi
of
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
resource
pool
Post by Pat Ferrel
and spit out a dataframe.
Once its in a Dataframe in R- go nuts with any
plotting
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
package
Post by Dmitriy Lyubimov
you
Post by Trevor Grant
Post by Pat Ferrel
like.
Post by Pat Ferrel
Likewise, it should be possible to do the same
thing
Post by Suneel Marthi
with
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
matplotlib
Post by Pat Ferrel
and
Post by Pat Ferrel
python (
https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
All of this doesn't necessarily require any
changing
of
Post by Suneel Marthi
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Zeppelin
Post by Pat Ferrel
source
Post by Pat Ferrel
code, and isn't very intrusive or difficult to
set
up,
Post by Suneel Marthi
Post by Trevor Grant
I'll
Post by Andrew Musselman
Post by Dmitriy Lyubimov
make
a
Post by Trevor Grant
Post by Pat Ferrel
blog
Post by Pat Ferrel
post but its almost a text book entry tutorial
on
using
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
imports
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin. (e.g. a tutorial would be just as at
home
on
Post by Suneel Marthi
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Zeppelin
Post by Trevor Grant
Post by Pat Ferrel
site
as
Post by Pat Ferrel
it would on the Mahout site).
Now, there has been some talk of using
Zeppelin's
Post by Suneel Marthi
Post by Trevor Grant
angularJS.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Things
Post by Pat Ferrel
get
a
Post by Pat Ferrel
little more harry in that case, but we could
make
an
Post by Suneel Marthi
Post by Trevor Grant
optional
Post by Andrew Musselman
Post by Dmitriy Lyubimov
build
Post by Trevor Grant
Post by Pat Ferrel
profile
Post by Pat Ferrel
that would make zeppelin recognize matrices at
tables
Post by Suneel Marthi
and
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
expose
Post by Trevor Grant
all
Post by Pat Ferrel
of
the
Post by Pat Ferrel
built in charting features of Zeppelin.
If you're not adding a bunch of custom charts
to
Post by Suneel Marthi
Zeppelin
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(which
Post by Trevor Grant
Post by Pat Ferrel
would
be
Post by Pat Ferrel
somewhat tedious), you're going to end up with
a
lot
of
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
examples
Post by Trevor Grant
Post by Pat Ferrel
where
you
Post by Pat Ferrel
create a table in Mahout/Spark pass it to
AngularJS
Post by Suneel Marthi
then
Post by Trevor Grant
some
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
AngularJS
Post by Pat Ferrel
code charts it for you. At that point however,
you're
Post by Suneel Marthi
Post by Trevor Grant
doing
Post by Andrew Musselman
Post by Dmitriy Lyubimov
just
Post by Trevor Grant
as
Post by Pat Ferrel
much
Post by Pat Ferrel
work, if not more than it would be to simply
pass
to
R
Post by Suneel Marthi
or
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Python
Post by Trevor Grant
and
Post by Pat Ferrel
let
Post by Pat Ferrel
ggplot or matlibplot do the work for you.
Finally, I haven't run into any errors yet
using
Kyro
Post by Suneel Marthi
Post by Trevor Grant
(which
Post by Andrew Musselman
in
Post by Dmitriy Lyubimov
Post by Trevor Grant
part
Post by Pat Ferrel
is
Post by Pat Ferrel
what makes me fear I'm not doing this right...
it
was
Post by Suneel Marthi
too
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
easy...)
Post by Trevor Grant
If
Post by Pat Ferrel
Post by Pat Ferrel
anything seems redundant or missing, please
call
it
Post by Suneel Marthi
out.
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
spark.kryo.registrator
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
spark.serializer
org.apache.spark.serializer.KryoSerializer
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Add artifacts (need to change these to maven
not
local,
Post by Suneel Marthi
Post by Trevor Grant
also
Post by Andrew Musselman
Post by Dmitriy Lyubimov
need
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
add/change one jar per below, however this does
/home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
/home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Add following code to first paragraph of
```
%spark
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import
org.apache.mahout.math.scalabindings.RLikeOps._
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Suneel Marthi
=
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
sc2sdc(sc)
```
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org
*"Fortunate is he, who is able to know the
causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil*
Post by Pat Ferrel
On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <
Post by Pat Ferrel
Creating an mc used to do some Kryo setup,
like
Post by Suneel Marthi
Post by Trevor Grant
registering
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
serializers
Post by Pat Ferrel
or
Post by Pat Ferrel
serializer factories IIRC. Also there is the
Spark
Post by Suneel Marthi
conf
Post by Trevor Grant
for
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
allocating
Post by Pat Ferrel
Post by Pat Ferrel
memory for the Kryo buffer. Look at the code
in
the
mc
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
creation
Post by Trevor Grant
code
Post by Pat Ferrel
in
Post by Pat Ferrel
the
Post by Pat Ferrel
Spark package helpers. All can be done in
straight
Post by Suneel Marthi
Spark
Post by Trevor Grant
and
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
passed
Post by Pat Ferrel
in
to
Post by Pat Ferrel
Post by Pat Ferrel
create the mc when needed. Again from old weak
brain
Post by Suneel Marthi
Post by Trevor Grant
cells
Post by Andrew Musselman
Post by Dmitriy Lyubimov
but I
Post by Trevor Grant
Post by Pat Ferrel
think
Post by Pat Ferrel
that
Post by Pat Ferrel
is part of what makes the Mahout shell
different
than
Post by Suneel Marthi
Post by Trevor Grant
teh
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Spark
Post by Trevor Grant
Post by Pat Ferrel
shell
Post by Pat Ferrel
plus
Post by Pat Ferrel
imports, it auto-creates the mc instead of or
along
Post by Suneel Marthi
with
Post by Trevor Grant
an
Post by Andrew Musselman
Post by Dmitriy Lyubimov
sc.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
When I get back to my computer I can check.
On May 16, 2016, at 3:40 PM, Andrew Palumbo <
Trevor,
Could you post any kryo errors that you may be
having?
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
________________________________
Sent: Monday, May 16, 2016 6:25:07 PM
To: mahout
Subject: Future Mahout - Zeppelin work
To Dmitriy's point, I agree ggplot is def the
priority,
Post by Trevor Grant
The
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
mahout
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are at this point are really just a POC, but
at
some
Post by Suneel Marthi
Post by Trevor Grant
point
Post by Andrew Musselman
we
Post by Dmitriy Lyubimov
may
Post by Trevor Grant
be
Post by Pat Ferrel
want
Post by Pat Ferrel
Post by Pat Ferrel
to integrate some data transformation features
into
Post by Suneel Marthi
the
Post by Trevor Grant
Post by Andrew Musselman
mahout
Post by Dmitriy Lyubimov
Post by Trevor Grant
plots
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
classes so they're really more future work.
OK. I'll read through the examples and try to
do
Post by Suneel Marthi
Post by Trevor Grant
something
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
data, then do a ggplot and/or an angular plot
on
it
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
(probably
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
ggplot).
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen
discussion
Post by Suneel Marthi
Post by Trevor Grant
on
Post by Andrew Musselman
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and
add
Post by Suneel Marthi
another
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Souds Great.
Thank you.
________________________________
Sent: Monday, May 16, 2016 5:49:17 PM
To: Dmitriy Lyubimov
Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
Subject: Re: Intro - Future Mahout - Zeppelin
work
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I just signed up for dev, should i just reply
all
and
Post by Suneel Marthi
cc
Post by Trevor Grant
dev
Post by Andrew Musselman
Post by Dmitriy Lyubimov
or
Post by Trevor Grant
Post by Pat Ferrel
start a
Post by Pat Ferrel
Post by Pat Ferrel
new thread?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[
https://avatars3.githubusercontent.com/u/5852441?v=3&s=400
Post by Andrew Musselman
]<
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<
https://github.com/rawkintrevo>
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
github.com
rawkintrevo has 12 repositories written in
Python,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Batchfile,
Post by Dmitriy Lyubimov
and
Post by Trevor Grant
R.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org
"Fortunate is he, who is able to know the
causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:46 PM, Dmitriy
Lyubimov
<
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
fwiw ggplot2 is pretty darn advanced:) i am a
bit
Post by Suneel Marthi
Post by Trevor Grant
skeptical
Post by Andrew Musselman
Post by Dmitriy Lyubimov
smile
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
have something that ggplot2 would not, the
other
way
Post by Suneel Marthi
Post by Trevor Grant
around
Post by Andrew Musselman
is
Post by Dmitriy Lyubimov
Post by Trevor Grant
much
Post by Pat Ferrel
more
Post by Pat Ferrel
Post by Pat Ferrel
expected by me:)
anyhow if ggplot2 and matplotlib are available
in
Post by Suneel Marthi
Post by Trevor Grant
Zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
without
Post by Trevor Grant
Post by Pat Ferrel
major
Post by Pat Ferrel
Post by Pat Ferrel
limitations, it sounds like Zeppelin should be
an
all
Post by Suneel Marthi
Post by Trevor Grant
around
Post by Andrew Musselman
Post by Dmitriy Lyubimov
very
Post by Trevor Grant
Post by Pat Ferrel
nice
Post by Pat Ferrel
Post by Pat Ferrel
venue then.
On Mon, May 16, 2016 at 2:42 PM, Andrew
Palumbo
<
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
sorry- answering a question from a couple
emails
back
Post by Suneel Marthi
on
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
thread.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
If possible, I think it would be great to
eventually
Post by Suneel Marthi
Post by Trevor Grant
have
Post by Andrew Musselman
Post by Dmitriy Lyubimov
both
Post by Trevor Grant
Post by Pat Ferrel
(native
Post by Pat Ferrel
Post by Pat Ferrel
mahout/smile plots and ggplot), since in the
future
Post by Suneel Marthi
Post by Trevor Grant
we're
Post by Andrew Musselman
Post by Dmitriy Lyubimov
going
to
Post by Trevor Grant
Post by Pat Ferrel
be
Post by Pat Ferrel
Post by Pat Ferrel
adding more visualization features rather than
simple
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
scatter
Post by Dmitriy Lyubimov
Post by Trevor Grant
plots
Post by Pat Ferrel
etc
Post by Pat Ferrel
Post by Pat Ferrel
that may not be covered by ggplot.
That's why we were thinking about using
angular
and
Post by Suneel Marthi
the
Post by Trevor Grant
Post by Andrew Musselman
pngs.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
But what youre saying in your last email would
be
Post by Suneel Marthi
great!
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thank you!
________________________________
Sent: Monday, May 16, 2016 5:33:12 PM
To: Andrew Palumbo
Cc: Pat Ferrel; Suneel Marthi; Dmitriy
Lyubimov
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin
work
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I somehow replied to your last email without
seeing
Post by Suneel Marthi
Post by Trevor Grant
it...
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
OK. I'll read through the examples and try to
do
Post by Suneel Marthi
Post by Trevor Grant
something
Post by Andrew Musselman
Post by Dmitriy Lyubimov
with
Post by Trevor Grant
some
Post by Pat Ferrel
Post by Pat Ferrel
data,
Post by Pat Ferrel
then do a ggplot and/or an angular plot on it
(probably
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
ggplot).
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I'll do a quick tutorial. Then I'll reopen
discussion
Post by Suneel Marthi
on
Post by Trevor Grant
Post by Andrew Musselman
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
Post by Pat Ferrel
issue about weather we want to go ahead and
add
Post by Suneel Marthi
another
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
interpreter.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org
"Fortunate is he, who is able to know the
causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:26 PM, Trevor Grant
<
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
sorry for double email but are you thinking
visualization
Post by Andrew Musselman
Post by Dmitriy Lyubimov
should
Post by Trevor Grant
be
Post by Pat Ferrel
a
Post by Pat Ferrel
Post by Pat Ferrel
library internal to mahout or should we
leverage
Post by Suneel Marthi
Post by Trevor Grant
zeppelins
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
visualization
Post by Pat Ferrel
Post by Pat Ferrel
capabilities?
Also, should we move this discussion to dev?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org
"Fortunate is he, who is able to know the
causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 4:14 PM, Andrew
Palumbo
<
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Sorry- to be a little more clear, Part of
what
we're
Post by Suneel Marthi
Post by Trevor Grant
trying
Post by Andrew Musselman
Post by Dmitriy Lyubimov
to
is
Post by Trevor Grant
Post by Pat Ferrel
to
get
Post by Pat Ferrel
Post by Pat Ferrel
the new plotting features integrated with
Zeppelin.
We
Post by Suneel Marthi
Post by Trevor Grant
plan
Post by Andrew Musselman
on
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
adding
Post by Pat Ferrel
more
Post by Pat Ferrel
advanced plotting.
________________________________
Sent: Monday, May 16, 2016 5:04:49 PM
To: Pat Ferrel; Trevor Grant
Cc: Suneel Marthi; Dmitriy Lyubimov
Subject: Re: Intro - Future Mahout - Zeppelin
work
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Awesome!
most of the hard work was done by Dmitriy[??]
,
I've
Post by Suneel Marthi
Post by Trevor Grant
just
Post by Andrew Musselman
Post by Dmitriy Lyubimov
reworked
Post by Trevor Grant
Post by Pat Ferrel
it a
Post by Pat Ferrel
Post by Pat Ferrel
couple of times to keep up with spark's
refactoring.
Post by Suneel Marthi
Post by Trevor Grant
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
For the new plotting features that we're
working
on.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
the plotting is still a work in progress, and
the
grid
Post by Suneel Marthi
Post by Trevor Grant
and
Post by Andrew Musselman
Post by Dmitriy Lyubimov
surface
Post by Trevor Grant
Post by Pat Ferrel
plots
Post by Pat Ferrel
Post by Pat Ferrel
are not working properly. The plots are swing
based
Post by Suneel Marthi
and
Post by Trevor Grant
can
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
currently
be
Post by Pat Ferrel
Post by Pat Ferrel
exported as PNGs. There are a few examples
on
the
Post by Suneel Marthi
Post by Trevor Grant
closed
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
https://github.com/apache/mahout/pull/230
There is an example script in
examples/bin/spark-shell-plot.mscala
https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Thanks!
________________________________
Sent: Monday, May 16, 2016 4:54:15 PM
To: Trevor Grant
Cc: Andrew Palumbo; Suneel Marthi; Dmitriy
Lyubimov
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin
work
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
This is only the beginning. Andy has been
using
Smile
Post by Suneel Marthi
Post by Trevor Grant
as a
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
visualization
Post by Pat Ferrel
Post by Pat Ferrel
lib since it is pretty rich in ML support. We
are
Post by Suneel Marthi
Post by Trevor Grant
looking
at
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
integrating
Post by Pat Ferrel
Post by Pat Ferrel
some of that with Zeppelin then adding code to
feed
Post by Suneel Marthi
the
Post by Trevor Grant
new
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
visualizations
Post by Pat Ferrel
in Mahout. I’m here because I’m fairly
familiar
with
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
AngularJS
Post by Dmitriy Lyubimov
if
Post by Trevor Grant
Post by Pat Ferrel
that’s
Post by Pat Ferrel
Post by Pat Ferrel
the way to go. Smile is swing based but can
output
Post by Suneel Marthi
pngs,
Post by Trevor Grant
Post by Andrew Musselman
maybe
Post by Dmitriy Lyubimov
Post by Trevor Grant
other
Post by Pat Ferrel
Post by Pat Ferrel
image
Post by Pat Ferrel
formats—Andy?
BTW Dmitriy is still very involved but has
rouble
Post by Suneel Marthi
Post by Trevor Grant
getting
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
permission
Post by Pat Ferrel
to
Post by Pat Ferrel
Post by Pat Ferrel
donate code.
On May 16, 2016, at 1:45 PM, Trevor Grant <
Hey Andrew,
thanks- you basically did all of the hard work
for
me!
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I've got the linear regression example working
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
my java is sketchy at best, i tend to over
import. I
Post by Suneel Marthi
Post by Trevor Grant
pulled
Post by Andrew Musselman
in
Post by Dmitriy Lyubimov
the
org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I think those are all necessary... should I
be
Post by Suneel Marthi
pulling
Post by Trevor Grant
in
Post by Andrew Musselman
Post by Dmitriy Lyubimov
more?
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
I hate to say it (but will do so bc this isn't
public)
Post by Suneel Marthi
Post by Trevor Grant
this
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
integration
Post by Pat Ferrel
is
Post by Pat Ferrel
super easy from a user perspective, almost too
easy-
Post by Suneel Marthi
eg
Post by Trevor Grant
why
Post by Andrew Musselman
Post by Dmitriy Lyubimov
not
Post by Trevor Grant
let
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
user add it themselves... Add the appropriate
maven
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
artifacts,
Post by Trevor Grant
Post by Pat Ferrel
restart
Post by Pat Ferrel
the
Post by Pat Ferrel
interpreter and run the following in a
```
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import
org.apache.mahout.math.scalabindings.RLikeOps._
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
import
org.apache.mahout.math.drm.RLikeDrmOps._
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
import org.apache.mahout.sparkbindings._
org.apache.mahout.sparkbindings.SparkDistributedContext
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
= sc2sdc(sc)
```
Then whatever code you want and you're off to
the
Post by Suneel Marthi
Post by Trevor Grant
races...
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
that said, adding a build profile like
-PsparkMahout
Post by Suneel Marthi
and
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
creating
Post by Trevor Grant
an
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpretter like %spark.mahout should be
fairly
Post by Suneel Marthi
Post by Trevor Grant
straight
Post by Andrew Musselman
Post by Dmitriy Lyubimov
forward.
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Second question, do you have an example that
would
be
Post by Suneel Marthi
Post by Trevor Grant
more
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
'visualization
Post by Pat Ferrel
Post by Pat Ferrel
friendly'? I could pass the results to Angular
or
R
Post by Suneel Marthi
just
Post by Trevor Grant
to
Post by Andrew Musselman
Post by Dmitriy Lyubimov
show
Post by Trevor Grant
off
Post by Pat Ferrel
how
Post by Pat Ferrel
to
Post by Pat Ferrel
do it.
Which leads back to the question, is this even
worth
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
building
Post by Dmitriy Lyubimov
a
Post by Trevor Grant
full
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
interpreter for or just make a really nice
blog
post
Post by Suneel Marthi
Post by Trevor Grant
with
Post by Andrew Musselman
Post by Dmitriy Lyubimov
examples
Post by Trevor Grant
Post by Pat Ferrel
on
how
Post by Pat Ferrel
Post by Pat Ferrel
to integrate with R...?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org<
http://trevorgrant.org/>
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
"Fortunate is he, who is able to know the
causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 2:09 PM, Andrew
Palumbo
<
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Hi Trevor, welcome!
It's great to have you helping out, thanks
very
much.
Post by Suneel Marthi
Post by Trevor Grant
I've
Post by Andrew Musselman
Post by Dmitriy Lyubimov
done a
Post by Trevor Grant
Post by Pat Ferrel
good
Post by Pat Ferrel
Post by Pat Ferrel
amount of work on our mahout spark shell .. so
let
me
Post by Suneel Marthi
Post by Trevor Grant
know
Post by Andrew Musselman
if
Post by Dmitriy Lyubimov
you
Post by Trevor Grant
Post by Pat Ferrel
have
Post by Pat Ferrel
any
Post by Pat Ferrel
questions there about what we did there..
Thanks alot!
Andy
-------- Original message --------
Date: 05/16/2016 2:44 PM (GMT-05:00)
,
Pat
,
Post by Trevor Grant
Post by Pat Ferrel
Andrew
Post by Pat Ferrel
Post by Pat Ferrel
Subject: Re: Intro - Future Mahout - Zeppelin
work
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Oh yes, he's around. I see him online.
On Mon, May 16, 2016 at 2:42 PM, Trevor Grant
<
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Is Dmitriy Lyubimov still around?
Looks like he created this issue for Zeppelin
a
while
Post by Suneel Marthi
Post by Trevor Grant
ago.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(The
Post by Trevor Grant
old
Post by Pat Ferrel
lost
Post by Pat Ferrel
Post by Pat Ferrel
code to which you were referring?)
https://issues.apache.org/jira/browse/ZEPPELIN-116
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org<
http://trevorgrant.org/>
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
"Fortunate is he, who is able to know the
causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Mon, May 16, 2016 at 1:37 PM, Suneel
Marthi <
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Welcome to the party TG !!
On Mon, May 16, 2016 at 2:28 PM, Trevor Grant
<
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Hey all,
I'm excited for a chance to help out. I'm
actually
Post by Suneel Marthi
Post by Trevor Grant
getting
Post by Andrew Musselman
Post by Dmitriy Lyubimov
ready
Post by Trevor Grant
to
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
download now and start playing around.
I had talked about this briefly but it given a
properly
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
functioning
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin interpreter for Apache Mahout, one
could
Post by Suneel Marthi
Post by Trevor Grant
leverage
Post by Andrew Musselman
all
Post by Dmitriy Lyubimov
of
Post by Trevor Grant
Post by Pat Ferrel
the
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin visualizations, anything in
AngularJS,
or
Post by Suneel Marthi
Post by Trevor Grant
anything
Post by Andrew Musselman
Post by Dmitriy Lyubimov
in R
Post by Trevor Grant
Post by Pat Ferrel
(through
Post by Pat Ferrel
Post by Pat Ferrel
clever use of Zeppelin's Resource Pools).
I'll work on getting logged in to the slack
channel
as
Post by Suneel Marthi
Post by Trevor Grant
well.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Nice to meet you all, looking forward to
helping
out!
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://trevorgrant.org<
http://trevorgrant.org/>
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
"Fortunate is he, who is able to know the
causes
of
Post by Suneel Marthi
Post by Trevor Grant
things."
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
-Virgil
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 12:56 PM, Suneel
Marthi
<
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
FYi...
Trevor was there for my talk, so he has some
idea
of
Post by Suneel Marthi
Post by Trevor Grant
Mahout
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Samsara.
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <
Hey Trevor,
Good to meet you. As you probably know
Mahout-Samsara
Post by Suneel Marthi
Post by Trevor Grant
is a
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
reincarnation
Post by Pat Ferrel
Post by Pat Ferrel
of the project in a new body, which is less a
collection
Post by Trevor Grant
of
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
algorithms
Post by Pat Ferrel
than
Post by Pat Ferrel
a roll-your-own math/algorithm tool. The major
benefit
Post by Suneel Marthi
Post by Trevor Grant
is
Post by Andrew Musselman
that
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
during
Post by Pat Ferrel
Post by Pat Ferrel
experimentation and later in production the
code
is
by
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
nature
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
scalable
on
Post by Pat Ferrel
Post by Pat Ferrel
Spark and Flink. Most of the Mahout DSL is
R-like
and
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
supports
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
tensor
Post by Pat Ferrel
math
Post by Pat Ferrel
but we are now looking at streaming online
algo
Post by Suneel Marthi
support
Post by Trevor Grant
too.
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
In any case you probably know we have a Mahout
version
Post by Suneel Marthi
Post by Trevor Grant
of
Post by Andrew Musselman
the
Post by Dmitriy Lyubimov
Post by Trevor Grant
Spark
Post by Pat Ferrel
Post by Pat Ferrel
Shell,
Post by Pat Ferrel
which has been integrated with an old version
of
Post by Suneel Marthi
Post by Trevor Grant
Zeppelin
Post by Andrew Musselman
Post by Dmitriy Lyubimov
(code
is
Post by Trevor Grant
Post by Pat Ferrel
lost).
Post by Pat Ferrel
Post by Pat Ferrel
Recently Andy has experimented with some very
nice
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
visualizations
Post by Trevor Grant
of
Post by Pat Ferrel
ML
Post by Pat Ferrel
Post by Pat Ferrel
data (not just analytics data). We as a
project
are
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
interested
Post by Dmitriy Lyubimov
in
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Zeppelin
Post by Pat Ferrel
integration of our shell and graphics. From
what I
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
understand
Post by Dmitriy Lyubimov
the
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
graphics
Post by Pat Ferrel
extension mechanism of Zeppelin is based on
AngularJS,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
which I
Post by Dmitriy Lyubimov
Post by Trevor Grant
have
Post by Pat Ferrel
some
Post by Pat Ferrel
Post by Pat Ferrel
experience with.
So, we’d like to start the conversation about
how
to
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
proceed.
Post by Dmitriy Lyubimov
We
Post by Trevor Grant
Post by Pat Ferrel
would
Post by Pat Ferrel
Post by Pat Ferrel
love some help but will move ahead in any
case.
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
Pat
On May 15, 2016, at 9:52 AM, Suneel Marthi <
Hi Trevor,
Nice meeting u last week in Vancouver. Per
our
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
conversation,
Post by Dmitriy Lyubimov
I
Post by Trevor Grant
Post by Pat Ferrel
wanted
to
Post by Pat Ferrel
Post by Pat Ferrel
introduce u to Andrew Palumbo (Mahout Chair)
and
Pat
Post by Suneel Marthi
Post by Trevor Grant
Ferrel
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Post by Trevor Grant
(Mahout
Post by Pat Ferrel
PMC).
Post by Pat Ferrel
Post by Pat Ferrel
As I mentioned in my talk, we are actively
looking
at
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Zeppelin
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Pat Ferrel
Post by Pat Ferrel
integration
Post by Pat Ferrel
with Mahout (primarily for spark) and would
appreciate
Post by Suneel Marthi
Post by Trevor Grant
your
Post by Andrew Musselman
Post by Dmitriy Lyubimov
help
Post by Trevor Grant
(as
Post by Pat Ferrel
also
Post by Pat Ferrel
Post by Pat Ferrel
all things DL and ML).
We definitely can use all your help as we r
revamping
Post by Suneel Marthi
Post by Trevor Grant
the
Post by Andrew Musselman
Post by Dmitriy Lyubimov
Mahout
Post by Trevor Grant
Post by Pat Ferrel
project
Post by Pat Ferrel
Post by Pat Ferrel
and shedding its legacy MapReduce image.
I sent u an invite to the Mahout slack
channel,
Post by Suneel Marthi
Post by Trevor Grant
Post by Andrew Musselman
Post by Dmitriy Lyubimov
mahout.apache.org
Post by Trevor Grant
<
Post by Pat Ferrel
Post by Pat Ferrel
Post by Pat Ferrel
http://mahout.apache.org/> - that's where we
all
Post by Suneel Marthi
Post by Trevor Grant
hangout
Post by Andrew Musselman
and
Post by Dmitriy Lyubimov
not
Post by Trevor Grant
Post by Pat Ferrel
having
Post by Pat Ferrel
Post by Pat Ferrel
to worry about avoiding naughty words.
Looking forward to working with you
Suneel
Loading...