Discussion:
Location of JARs
Trevor Grant
2016-06-01 14:47:47 UTC
Permalink
I'm trying to refactor the Mahout dependency from the pom.xml of the Spark
interpreter (adding Mahout integration to Zeppelin)

Assuming MAHOUT_HOME is available, I see that the jars in source build live
in a different place than the jars in the binary distribution.

I'm to the point where I'm trying to come up with a good place to pick up
the required jars while allowing for:
1. flexability in Mahout versions
2. Not writing a huge block of code designed to scan several conceivable
places throughout the file system.

One thought was to put the onus on the user to move the desired jars to a
local repo within the Zeppelin directory.

Wanted to open up to input from users and dev as I consider this.

Is documentation specifying which JARs need to be moved to a specific
directory and places you are likely to find them to much to ask of users?

Other approaches?

For background, Zeppelin starts a Spark Shell and we need to make sure all
of the required Mahout jars get loaded in the class path when spark starts.
The question is where do all of these JARs relatively live.

Thanks for any feedback,
tg






Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Dmitriy Lyubimov
2016-06-01 17:35:47 UTC
Permalink
I am just going to give you some design intents in the existing code.

as far as i can recollect, mahout context gives complete flexibility. You
can control the behavior but various degrees of overriding the default
behavior and doing more or less work on context setup on your own. (I
assume we are talking specifically about sparkbindings).

By default, the mahoutSparkContext() helper of the sparkbindings package
tries to locate the jars in whatever MAHOUT_HOME/bin/classpath -spark tells
it. (btw this part can be rewritten much more elegantly and robustly with
scala.sys.process._ capabilities of Scala; it's just this code is really
more than 3 years old now and i was not that deep with Scala back then to
know its shell DSL in such detail).

the logic of MAHOUT-HOME/bin/classpath -spark is admittedly pretty
convoluted and there are location variations between binary distribution
and maven-built source locations. I can't say i understand the underlying
structure or motivation for that structure very well there .

(1) E.g. you can tell it to ignore automatically adding these jars to
context and instead use your own algorithm to locate those (e.g. in
Zeppelin home or something). You also can do it in more than one way:
(1a) set addMahoutJars = false. the correct behavior should ignore
requirement of MAHOUT_HOME then; and subsequently you can include necessary
mahout jars could be supplied from your custom location in `customJars`
parameter;
(1b) or you can also set addMahoutJars=false and add them via supplied
custom sparkConf (which is the base configuration for everything before
mahout tries to add its own requirements to configuration).

(2) finally, you can completely take over spark context creation and wrap
already existing context into a mahout context via implicit (or explicit)
conversion given in the same package, `sc2sdc`. E.g. you can do it
implicity:

import o.a.m.sparkbindings._

val mahoutContext:SparkDistributedContext = sparkContext // this is of type
o.a.spark.SparkContext

that's it.

Note that in that case you have to take over on more work than just
adjusting context JAR classpath. you will have to do all the customizations
mahout does to context such as ensuring minimum requirements of kryo
serialization (you can see the code what currently is enforced, but i think
this is largely just the kryo serialization requirement).

Now, if you want to do custom classpath: naturally you don't need all
mahout jars. In case of spark backend execution, you need to filter to
include only mahout-math, mahout-math-scala and mahout-spark.

I am fairly sure that modern state of the project also requires
mahout-spark-[blah]-dependency-reduced.jar to be redistributed to backend
as well (which are minimum 3rd party shaded dependencies apparently engaged
by some algorithms in the backend as well -- it used to be absent from
backend requirements though).

-d
Post by Trevor Grant
I'm trying to refactor the Mahout dependency from the pom.xml of the Spark
interpreter (adding Mahout integration to Zeppelin)
Assuming MAHOUT_HOME is available, I see that the jars in source build live
in a different place than the jars in the binary distribution.
I'm to the point where I'm trying to come up with a good place to pick up
1. flexability in Mahout versions
2. Not writing a huge block of code designed to scan several conceivable
places throughout the file system.
One thought was to put the onus on the user to move the desired jars to a
local repo within the Zeppelin directory.
Wanted to open up to input from users and dev as I consider this.
Is documentation specifying which JARs need to be moved to a specific
directory and places you are likely to find them to much to ask of users?
Other approaches?
For background, Zeppelin starts a Spark Shell and we need to make sure all
of the required Mahout jars get loaded in the class path when spark starts.
The question is where do all of these JARs relatively live.
Thanks for any feedback,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Dmitriy Lyubimov
2016-06-01 17:46:20 UTC
Permalink
Post by Trevor Grant
Other approaches?
For background, Zeppelin starts a Spark Shell and we need to make sure all
of the required Mahout jars get loaded in the class path when spark starts.
The question is where do all of these JARs relatively live.
How does zeppelin copes with extra dependencies for other interpreters
(even spark itself)? I guess we should follow the same practice there.

Release independence of location algorithm largely depends on jar filters
(again, see filters in the spark binding package). It is possible that
artifacts required may change but not very likely (i don't think they ever
changed since 0.10). so it should be possible to build (mahout)
release-independent logic to locate, filter and assert the necessary jars.
Post by Trevor Grant
Thanks for any feedback,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Dmitriy Lyubimov
2016-06-01 17:48:44 UTC
Permalink
Post by Dmitriy Lyubimov
Post by Trevor Grant
Other approaches?
For background, Zeppelin starts a Spark Shell and we need to make sure all
of the required Mahout jars get loaded in the class path when spark starts.
The question is where do all of these JARs relatively live.
How does zeppelin copes with extra dependencies for other interpreters
(even spark itself)? I guess we should follow the same practice there.
Release independence of location algorithm largely depends on jar filters
(again, see filters in the spark binding package). It is possible that
artifacts required may change but not very likely (i don't think they ever
changed since 0.10). so it should be possible to build (mahout)
release-independent logic to locate, filter and assert the necessary jars.
PS this may change soon though if/when custom javacpp code is built, we
may probably want to keep all native things as separate release artifacts,
as they are basically treated as optionally available accellerators and may
or may not be properly loaded in all situations. hence they may warrant a
seaprate jar vehicle.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Thanks for any feedback,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Trevor Grant
2016-06-02 15:24:29 UTC
Permalink
Would you mind having a look at
https://github.com/apache/incubator-zeppelin/pull/928/files
to see if I'm missing anything critical.

The idea is the user specifies a directory containing the necessary (to be
covered in the setup documentation), and the jars are loaded from there.
Also adds some configuration settings (mainly Kyro) when 'spark.mahout' is
true. Finally imports the mahout and sets up the sdc from the already
declared sc.

Based on my testing that works in local and cluster mode.

Thanks,
tg


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Dmitriy Lyubimov
Post by Dmitriy Lyubimov
Post by Trevor Grant
Other approaches?
For background, Zeppelin starts a Spark Shell and we need to make sure
all
Post by Dmitriy Lyubimov
Post by Trevor Grant
of the required Mahout jars get loaded in the class path when spark starts.
The question is where do all of these JARs relatively live.
How does zeppelin copes with extra dependencies for other interpreters
(even spark itself)? I guess we should follow the same practice there.
Release independence of location algorithm largely depends on jar filters
(again, see filters in the spark binding package). It is possible that
artifacts required may change but not very likely (i don't think they
ever
Post by Dmitriy Lyubimov
changed since 0.10). so it should be possible to build (mahout)
release-independent logic to locate, filter and assert the necessary
jars.
PS this may change soon though if/when custom javacpp code is built, we
may probably want to keep all native things as separate release artifacts,
as they are basically treated as optionally available accellerators and may
or may not be properly loaded in all situations. hence they may warrant a
seaprate jar vehicle.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Thanks for any feedback,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Dmitriy Lyubimov
2016-06-02 17:23:59 UTC
Permalink
i already looked. my main concern is that it meddles with spark interpreter
code too much which may create friction with spark interpreters in future.
it may be hard to have two products integration code coherent in one
component (in this case, the same interpreter class/file). I don't want to
put this comment to zeppelin discussion, but internally i think it should
be a concern for us.

Is it possible to have a standalone mahout-spark interpreter but use the
same spark configuration as configured for spark interpreter? If yes, i
would very much like not to have spark-alone and spark+mahout code
intermingled in same interpreter class.

visually, it probably also would be preferable to have a block that would
require boiler of something like

%spark.mahout

... blah ....
Post by Trevor Grant
Would you mind having a look at
https://github.com/apache/incubator-zeppelin/pull/928/files
to see if I'm missing anything critical.
The idea is the user specifies a directory containing the necessary (to be
covered in the setup documentation), and the jars are loaded from there.
Also adds some configuration settings (mainly Kyro) when 'spark.mahout' is
true. Finally imports the mahout and sets up the sdc from the already
declared sc.
Based on my testing that works in local and cluster mode.
Thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Dmitriy Lyubimov
Post by Dmitriy Lyubimov
Post by Trevor Grant
Other approaches?
For background, Zeppelin starts a Spark Shell and we need to make sure
all
Post by Dmitriy Lyubimov
Post by Trevor Grant
of the required Mahout jars get loaded in the class path when spark starts.
The question is where do all of these JARs relatively live.
How does zeppelin copes with extra dependencies for other interpreters
(even spark itself)? I guess we should follow the same practice there.
Release independence of location algorithm largely depends on jar
filters
Post by Dmitriy Lyubimov
Post by Dmitriy Lyubimov
(again, see filters in the spark binding package). It is possible that
artifacts required may change but not very likely (i don't think they
ever
Post by Dmitriy Lyubimov
changed since 0.10). so it should be possible to build (mahout)
release-independent logic to locate, filter and assert the necessary
jars.
PS this may change soon though if/when custom javacpp code is built, we
may probably want to keep all native things as separate release
artifacts,
Post by Dmitriy Lyubimov
as they are basically treated as optionally available accellerators and
may
Post by Dmitriy Lyubimov
or may not be properly loaded in all situations. hence they may warrant a
seaprate jar vehicle.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Thanks for any feedback,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Trevor Grant
2016-06-02 18:00:32 UTC
Permalink
I agree and have been thinking so more and more over the last couple of
days.

I'm going to start tinkering with that idea this afternoon / remainder of
week.



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Dmitriy Lyubimov
i already looked. my main concern is that it meddles with spark interpreter
code too much which may create friction with spark interpreters in future.
it may be hard to have two products integration code coherent in one
component (in this case, the same interpreter class/file). I don't want to
put this comment to zeppelin discussion, but internally i think it should
be a concern for us.
Is it possible to have a standalone mahout-spark interpreter but use the
same spark configuration as configured for spark interpreter? If yes, i
would very much like not to have spark-alone and spark+mahout code
intermingled in same interpreter class.
visually, it probably also would be preferable to have a block that would
require boiler of something like
%spark.mahout
... blah ....
Post by Trevor Grant
Would you mind having a look at
https://github.com/apache/incubator-zeppelin/pull/928/files
to see if I'm missing anything critical.
The idea is the user specifies a directory containing the necessary (to
be
Post by Trevor Grant
covered in the setup documentation), and the jars are loaded from there.
Also adds some configuration settings (mainly Kyro) when 'spark.mahout'
is
Post by Trevor Grant
true. Finally imports the mahout and sets up the sdc from the already
declared sc.
Based on my testing that works in local and cluster mode.
Thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Dmitriy Lyubimov
On Wed, Jun 1, 2016 at 7:47 AM, Trevor Grant <
Post by Trevor Grant
Other approaches?
For background, Zeppelin starts a Spark Shell and we need to make
sure
Post by Trevor Grant
Post by Dmitriy Lyubimov
all
Post by Trevor Grant
of the required Mahout jars get loaded in the class path when spark starts.
The question is where do all of these JARs relatively live.
How does zeppelin copes with extra dependencies for other
interpreters
Post by Trevor Grant
Post by Dmitriy Lyubimov
(even spark itself)? I guess we should follow the same practice
there.
Post by Trevor Grant
Post by Dmitriy Lyubimov
Release independence of location algorithm largely depends on jar
filters
Post by Dmitriy Lyubimov
(again, see filters in the spark binding package). It is possible
that
Post by Trevor Grant
Post by Dmitriy Lyubimov
artifacts required may change but not very likely (i don't think they
ever
changed since 0.10). so it should be possible to build (mahout)
release-independent logic to locate, filter and assert the necessary
jars.
PS this may change soon though if/when custom javacpp code is built,
we
Post by Trevor Grant
Post by Dmitriy Lyubimov
may probably want to keep all native things as separate release
artifacts,
Post by Dmitriy Lyubimov
as they are basically treated as optionally available accellerators and
may
Post by Dmitriy Lyubimov
or may not be properly loaded in all situations. hence they may
warrant a
Post by Trevor Grant
Post by Dmitriy Lyubimov
seaprate jar vehicle.
Post by Trevor Grant
Thanks for any feedback,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Dmitriy Lyubimov
2016-06-02 18:08:53 UTC
Permalink
Thank you, Trevor, for doing this. i think it is tremendously useful for
this project.
Post by Trevor Grant
I agree and have been thinking so more and more over the last couple of
days.
I'm going to start tinkering with that idea this afternoon / remainder of
week.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Dmitriy Lyubimov
i already looked. my main concern is that it meddles with spark
interpreter
Post by Dmitriy Lyubimov
code too much which may create friction with spark interpreters in
future.
Post by Dmitriy Lyubimov
it may be hard to have two products integration code coherent in one
component (in this case, the same interpreter class/file). I don't want
to
Post by Dmitriy Lyubimov
put this comment to zeppelin discussion, but internally i think it should
be a concern for us.
Is it possible to have a standalone mahout-spark interpreter but use the
same spark configuration as configured for spark interpreter? If yes, i
would very much like not to have spark-alone and spark+mahout code
intermingled in same interpreter class.
visually, it probably also would be preferable to have a block that would
require boiler of something like
%spark.mahout
... blah ....
Post by Trevor Grant
Would you mind having a look at
https://github.com/apache/incubator-zeppelin/pull/928/files
to see if I'm missing anything critical.
The idea is the user specifies a directory containing the necessary (to
be
Post by Trevor Grant
covered in the setup documentation), and the jars are loaded from
there.
Post by Dmitriy Lyubimov
Post by Trevor Grant
Also adds some configuration settings (mainly Kyro) when 'spark.mahout'
is
Post by Trevor Grant
true. Finally imports the mahout and sets up the sdc from the already
declared sc.
Based on my testing that works in local and cluster mode.
Thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Dmitriy Lyubimov
On Wed, Jun 1, 2016 at 7:47 AM, Trevor Grant <
Post by Trevor Grant
Other approaches?
For background, Zeppelin starts a Spark Shell and we need to make
sure
Post by Trevor Grant
Post by Dmitriy Lyubimov
all
Post by Trevor Grant
of the required Mahout jars get loaded in the class path when
spark
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Dmitriy Lyubimov
Post by Trevor Grant
starts.
The question is where do all of these JARs relatively live.
How does zeppelin copes with extra dependencies for other
interpreters
Post by Trevor Grant
Post by Dmitriy Lyubimov
(even spark itself)? I guess we should follow the same practice
there.
Post by Trevor Grant
Post by Dmitriy Lyubimov
Release independence of location algorithm largely depends on jar
filters
Post by Dmitriy Lyubimov
(again, see filters in the spark binding package). It is possible
that
Post by Trevor Grant
Post by Dmitriy Lyubimov
artifacts required may change but not very likely (i don't think
they
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Dmitriy Lyubimov
ever
changed since 0.10). so it should be possible to build (mahout)
release-independent logic to locate, filter and assert the
necessary
Post by Dmitriy Lyubimov
Post by Trevor Grant
Post by Dmitriy Lyubimov
jars.
PS this may change soon though if/when custom javacpp code is built,
we
Post by Trevor Grant
Post by Dmitriy Lyubimov
may probably want to keep all native things as separate release
artifacts,
Post by Dmitriy Lyubimov
as they are basically treated as optionally available accellerators
and
Post by Dmitriy Lyubimov
Post by Trevor Grant
may
Post by Dmitriy Lyubimov
or may not be properly loaded in all situations. hence they may
warrant a
Post by Trevor Grant
Post by Dmitriy Lyubimov
seaprate jar vehicle.
Post by Trevor Grant
Thanks for any feedback,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
Loading...