Location of JARs

Discussion:

Location of JARs

Trevor Grant

2016-06-01 14:47:47 UTC

I'm trying to refactor the Mahout dependency from the pom.xml of the Spark
interpreter (adding Mahout integration to Zeppelin)

Assuming MAHOUT_HOME is available, I see that the jars in source build live
in a different place than the jars in the binary distribution.

I'm to the point where I'm trying to come up with a good place to pick up
the required jars while allowing for:
1. flexability in Mahout versions
2. Not writing a huge block of code designed to scan several conceivable
places throughout the file system.

One thought was to put the onus on the user to move the desired jars to a
local repo within the Zeppelin directory.

Wanted to open up to input from users and dev as I consider this.

Is documentation specifying which JARs need to be moved to a specific
directory and places you are likely to find them to much to ask of users?

Other approaches?

For background, Zeppelin starts a Spark Shell and we need to make sure all
of the required Mahout jars get loaded in the class path when spark starts.
The question is where do all of these JARs relatively live.

Thanks for any feedback,
tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

Dmitriy Lyubimov

2016-06-01 17:35:47 UTC

Permalink

I am just going to give you some design intents in the existing code.

as far as i can recollect, mahout context gives complete flexibility. You
can control the behavior but various degrees of overriding the default
behavior and doing more or less work on context setup on your own. (I
assume we are talking specifically about sparkbindings).

By default, the mahoutSparkContext() helper of the sparkbindings package
tries to locate the jars in whatever MAHOUT_HOME/bin/classpath -spark tells
it. (btw this part can be rewritten much more elegantly and robustly with
scala.sys.process._ capabilities of Scala; it's just this code is really
more than 3 years old now and i was not that deep with Scala back then to
know its shell DSL in such detail).

the logic of MAHOUT-HOME/bin/classpath -spark is admittedly pretty
convoluted and there are location variations between binary distribution
and maven-built source locations. I can't say i understand the underlying
structure or motivation for that structure very well there .

(1) E.g. you can tell it to ignore automatically adding these jars to
context and instead use your own algorithm to locate those (e.g. in
Zeppelin home or something). You also can do it in more than one way:
(1a) set addMahoutJars = false. the correct behavior should ignore
requirement of MAHOUT_HOME then; and subsequently you can include necessary
mahout jars could be supplied from your custom location in `customJars`
parameter;
(1b) or you can also set addMahoutJars=false and add them via supplied
custom sparkConf (which is the base configuration for everything before
mahout tries to add its own requirements to configuration).

(2) finally, you can completely take over spark context creation and wrap
already existing context into a mahout context via implicit (or explicit)
conversion given in the same package, `sc2sdc`. E.g. you can do it
implicity:

import o.a.m.sparkbindings._

val mahoutContext:SparkDistributedContext = sparkContext // this is of type
o.a.spark.SparkContext

that's it.

Note that in that case you have to take over on more work than just
adjusting context JAR classpath. you will have to do all the customizations
mahout does to context such as ensuring minimum requirements of kryo
serialization (you can see the code what currently is enforced, but i think
this is largely just the kryo serialization requirement).

Now, if you want to do custom classpath: naturally you don't need all
mahout jars. In case of spark backend execution, you need to filter to
include only mahout-math, mahout-math-scala and mahout-spark.

I am fairly sure that modern state of the project also requires
mahout-spark-[blah]-dependency-reduced.jar to be redistributed to backend
as well (which are minimum 3rd party shaded dependencies apparently engaged
by some algorithms in the backend as well -- it used to be absent from
backend requirements though).

-d

Post by Trevor Grant
I'm trying to refactor the Mahout dependency from the pom.xml of the Spark
interpreter (adding Mahout integration to Zeppelin)
Assuming MAHOUT_HOME is available, I see that the jars in source build live
in a different place than the jars in the binary distribution.
I'm to the point where I'm trying to come up with a good place to pick up
1. flexability in Mahout versions
2. Not writing a huge block of code designed to scan several conceivable
places throughout the file system.
One thought was to put the onus on the user to move the desired jars to a
local repo within the Zeppelin directory.
Wanted to open up to input from users and dev as I consider this.
Is documentation specifying which JARs need to be moved to a specific
directory and places you are likely to find them to much to ask of users?
Other approaches?
For background, Zeppelin starts a Spark Shell and we need to make sure all
of the required Mahout jars get loaded in the class path when spark starts.
The question is where do all of these JARs relatively live.
Thanks for any feedback,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*

Dmitriy Lyubimov

2016-06-01 17:46:20 UTC

Permalink

Post by Trevor Grant
Other approaches?
For background, Zeppelin starts a Spark Shell and we need to make sure all
of the required Mahout jars get loaded in the class path when spark starts.
The question is where do all of these JARs relatively live.

How does zeppelin copes with extra dependencies for other interpreters
(even spark itself)? I guess we should follow the same practice there.

Release independence of location algorithm largely depends on jar filters
(again, see filters in the spark binding package). It is possible that
artifacts required may change but not very likely (i don't think they ever
changed since 0.10). so it should be possible to build (mahout)
release-independent logic to locate, filter and assert the necessary jars.

Post by Trevor Grant
Thanks for any feedback,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*

Dmitriy Lyubimov

2016-06-01 17:48:44 UTC