[DISCUSS] Naming convention for multiple spark/scala combos

Discussion:

Trevor Grant

2017-07-07 15:57:53 UTC

Hey all,

Working on releasing 0.13.1 with multiple spark/scala combos.

Afaik, there is no 'standard' for multiple spark versions (but I may be
wrong, I don't claim expertise here).

One approach is simply only release binaries for:
Spark-1.6 + Scala 2.10
Spark-2.1 + Scala 2.11

OR

We could do like dl4j

org.apache.mahout:mahout-spark_2.10:0.13.1_spark_1
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_1

org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_2

OR

some other option I don't know of.

Dmitriy Lyubimov

2017-07-07 17:24:51 UTC

Permalink

it would seem 2nd option is preferable if doable. Any option that has most
desirable combinations prebuilt, is preferable i guess. Spark itself also
releases tons of hadoop profile binary variations. so i don't have to build
one myself.

Post by Trevor Grant
Hey all,
Working on releasing 0.13.1 with multiple spark/scala combos.
Afaik, there is no 'standard' for multiple spark versions (but I may be
wrong, I don't claim expertise here).
Spark-1.6 + Scala 2.10
Spark-2.1 + Scala 2.11
OR
We could do like dl4j
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_1
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_1
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_2
OR
some other option I don't know of.

Holden Karau

2017-07-07 20:24:24 UTC

Permalink

Trevor looped me in on this since I hadn't had a chance to subscribe to the
list yet (on now :)).

Artifacts from cross spark-version building isn't super standardized (and
their are two sort of very different types of cross-building).

For folks who just need to build for the 1.X and 2.X and branches appending
_spark1 & _spark2 to the version string is indeed pretty common and the
DL4J folks do something pretty similar as Trevor pointed out.

The folks over at hammerlab have made some sbt specific tooling to make
this easier to do on the publishing side (see https://github.com/hammer
lab/sbt-parent )

It is true some people build Scala 2.10 artifacts for Spark 1.X series and
2.11 artifacts for Spark 2.X series only and use that to differentiate (I
don't personally like this approach since it is super opaque and someone
could upgrade their Scala version and then accidentally be using a
different version of Spark which would likely not go very well).

For folks who need to hook into internals and cross build against different
minor versions there is much less of a consistent pattern, personally
spark-testing-base is released as:

[artifactname]_[scalaversion]:[sparkversion]_[artifact releaseversion]

But this really only makes sense when you have to cross-build for lots of
different Spark versions (which should be avoidable for Mahout).

Since you are likely not depending on the internals of different point
releases, I'd think the _spark1 / _spark2 is probably the right way (or
_spark_1 / _spark_2 is fine too).

---------- Forwarded message ----------
Date: Fri, Jul 7, 2017 at 12:28 PM
Subject: Re: [DISCUSS] Naming convention for multiple spark/scala combos
mahout-spark-2.11_2.10-0.13.1.jar
mahout-spark-2.11_2.11-0.13.1.jar
mahout-math-scala-2.11_2.10-0.13.1.jar
i.e. <module>-<spark version>-<scala version>-<mahout-version>.jar
not exactly pretty.. I somewhat prefer Trevor's idea of Dl4j convention.
________________________________
Sent: Friday, July 7, 2017 11:57:53 AM
Subject: [DISCUSS] Naming convention for multiple spark/scala combos
Hey all,
Working on releasing 0.13.1 with multiple spark/scala combos.
Afaik, there is no 'standard' for multiple spark versions (but I may be
wrong, I don't claim expertise here).
Spark-1.6 + Scala 2.10
Spark-2.1 + Scala 2.11
OR
We could do like dl4j
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_1
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_1
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_2
OR
some other option I don't know of.

--
Cell : 425-233-8271 <(425)%20233-8271>

Trevor Grant

2017-07-07 21:05:34 UTC

Permalink

So to tie all of this together-

org.apache.mahout:mahout-spark_2.10:0.13.1_spark_1_6
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2_0
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2_1

org.apache.mahout:mahout-spark_2.11:0.13.1_spark_1_6
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_2_0
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_2_1

(will jars compiled with 2.1 dependencies run on 2.0? I assume not, but I
don't know) (afaik, mahout compiled for spark 1.6.x tends to work with
spark 1.6.y, anecdotal)

A non-trivial motivation here, is we would like all of these available to
tighten up the Apache Zeppelin integration, where the user could have a
number of different spark/scala combos going on and we want it to 'just
work' out of the box (which means a wide array of binaries available, to
dmitriy's point).

I'm +1 on this, and as RM will begin cutting a provisional RC, just to try
to figure out how all of this will work (it's my first time as release
master, and this is a new thing we're doing).

72 hour lazy consensus. (will probably take me 72 hours to figure out
anyway ;) )

If no objections expect an RC on Monday evening.

tg

Post by Holden Karau
Trevor looped me in on this since I hadn't had a chance to subscribe to
the list yet (on now :)).
Artifacts from cross spark-version building isn't super standardized (and
their are two sort of very different types of cross-building).
For folks who just need to build for the 1.X and 2.X and branches
appending _spark1 & _spark2 to the version string is indeed pretty common
and the DL4J folks do something pretty similar as Trevor pointed out.
The folks over at hammerlab have made some sbt specific tooling to make
this easier to do on the publishing side (see https://github.com/hammer
lab/sbt-parent )
It is true some people build Scala 2.10 artifacts for Spark 1.X series and
2.11 artifacts for Spark 2.X series only and use that to differentiate (I
don't personally like this approach since it is super opaque and someone
could upgrade their Scala version and then accidentally be using a
different version of Spark which would likely not go very well).
For folks who need to hook into internals and cross build against
different minor versions there is much less of a consistent pattern,
[artifactname]_[scalaversion]:[sparkversion]_[artifact releaseversion]
But this really only makes sense when you have to cross-build for lots of
different Spark versions (which should be avoidable for Mahout).
Since you are likely not depending on the internals of different point
releases, I'd think the _spark1 / _spark2 is probably the right way (or
_spark_1 / _spark_2 is fine too).

--
Cell : 425-233-8271 <(425)%20233-8271>

Pat Ferrel

2017-07-08 05:35:35 UTC

Permalink

IIRC these all fit sbt’s conventons?

On Jul 7, 2017, at 2:05 PM, Trevor Grant <***@gmail.com> wrote:

So to tie all of this together-

org.apache.mahout:mahout-spark_2.10:0.13.1_spark_1_6
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2_0
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2_1

org.apache.mahout:mahout-spark_2.11:0.13.1_spark_1_6
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_2_0
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_2_1

(will jars compiled with 2.1 dependencies run on 2.0? I assume not, but I
don't know) (afaik, mahout compiled for spark 1.6.x tends to work with
spark 1.6.y, anecdotal)

A non-trivial motivation here, is we would like all of these available to
tighten up the Apache Zeppelin integration, where the user could have a
number of different spark/scala combos going on and we want it to 'just
work' out of the box (which means a wide array of binaries available, to
dmitriy's point).

I'm +1 on this, and as RM will begin cutting a provisional RC, just to try
to figure out how all of this will work (it's my first time as release
master, and this is a new thing we're doing).

72 hour lazy consensus. (will probably take me 72 hours to figure out
anyway ;) )

If no objections expect an RC on Monday evening.

tg

--
Cell : 425-233-8271 <(425)%20233-8271>

Holden Karau

2017-07-07 23:10:10 UTC

Permalink

Thanks! :)

Welcome!
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 07/07/2017 1:24 PM (GMT-08:00)
Subject: Re: [DISCUSS] Naming convention for multiple spark/scala combos
Trevor looped me in on this since I hadn't had a chance to subscribe to the
list yet (on now :)).
Artifacts from cross spark-version building isn't super standardized (and
their are two sort of very different types of cross-building).
For folks who just need to build for the 1.X and 2.X and branches appending
_spark1 & _spark2 to the version string is indeed pretty common and the
DL4J folks do something pretty similar as Trevor pointed out.
The folks over at hammerlab have made some sbt specific tooling to make
this easier to do on the publishing side (see https://github.com/hammer
lab/sbt-parent )
It is true some people build Scala 2.10 artifacts for Spark 1.X series and
2.11 artifacts for Spark 2.X series only and use that to differentiate (I
don't personally like this approach since it is super opaque and someone
could upgrade their Scala version and then accidentally be using a
different version of Spark which would likely not go very well).
For folks who need to hook into internals and cross build against different
minor versions there is much less of a consistent pattern, personally
[artifactname]_[scalaversion]:[sparkversion]_[artifact releaseversion]
But this really only makes sense when you have to cross-build for lots of
different Spark versions (which should be avoidable for Mahout).
Since you are likely not depending on the internals of different point
releases, I'd think the _spark1 / _spark2 is probably the right way (or
_spark_1 / _spark_2 is fine too).

--
Cell : 425-233-8271 <(425)%20233-8271>

--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Andrew Musselman

2017-07-08 00:31:29 UTC

Permalink

Welcome Holden; how's that release going Trev :)

Post by Holden Karau
Thanks! :)

the

list yet (on now :)).
Artifacts from cross spark-version building isn't super standardized (and
their are two sort of very different types of cross-building).
For folks who just need to build for the 1.X and 2.X and branches

appending

_spark1 & _spark2 to the version string is indeed pretty common and the
DL4J folks do something pretty similar as Trevor pointed out.
The folks over at hammerlab have made some sbt specific tooling to make
this easier to do on the publishing side (see https://github.com/hammer
lab/sbt-parent )
It is true some people build Scala 2.10 artifacts for Spark 1.X series

and

2.11 artifacts for Spark 2.X series only and use that to differentiate (I
don't personally like this approach since it is super opaque and someone
could upgrade their Scala version and then accidentally be using a
different version of Spark which would likely not go very well).
For folks who need to hook into internals and cross build against

different

minor versions there is much less of a consistent pattern, personally
[artifactname]_[scalaversion]:[sparkversion]_[artifact releaseversion]
But this really only makes sense when you have to cross-build for lots of
different Spark versions (which should be avoidable for Mahout).
Since you are likely not depending on the internals of different point
releases, I'd think the _spark1 / _spark2 is probably the right way (or
_spark_1 / _spark_2 is fine too).

---------- Forwarded message ----------
Date: Fri, Jul 7, 2017 at 12:28 PM
Subject: Re: [DISCUSS] Naming convention for multiple spark/scala

combos

mahout-spark-2.11_2.10-0.13.1.jar
mahout-spark-2.11_2.11-0.13.1.jar
mahout-math-scala-2.11_2.10-0.13.1.jar
i.e. <module>-<spark version>-<scala version>-<mahout-version>.jar
not exactly pretty.. I somewhat prefer Trevor's idea of Dl4j

convention.

________________________________
Sent: Friday, July 7, 2017 11:57:53 AM
Subject: [DISCUSS] Naming convention for multiple spark/scala combos
Hey all,
Working on releasing 0.13.1 with multiple spark/scala combos.
Afaik, there is no 'standard' for multiple spark versions (but I may be
wrong, I don't claim expertise here).
Spark-1.6 + Scala 2.10
Spark-2.1 + Scala 2.11
OR
We could do like dl4j
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_1
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_1
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_2
OR
some other option I don't know of.

--
Cell : 425-233-8271 <(425)%20233-8271>

--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Trevor Grant

2017-07-10 04:48:51 UTC

Permalink

So I've been messing with this all night-

Maven really doesn't seem to like this idea of tacking a string on to a
version number. I can make it work- but its sloppy and really fattens up
the poms. (We end up with things like 0.13.2-SNAPSHOT-spark_1.6-SNAPSHOT,
or a lot of plugins that I think would eventually work, but I was unable to
wrangle them)

The alternative, maven's preferred method as far as I can tell, is adding a
classifier.

This gives a maven coordinate of

org.apache.mahout:mahout-spark_scala-2.10:spark-1.6:0.13.2
org.apache.mahout:mahout-spark_scala-2.11:spark-2.1:0.13.2

The jars come out looking like:
mahout-spark_2.10-0.13.2-spark_1.6.jar
mahout-spark_2.11-0.13.2-spark_2.1.jar

If one were importing into their pom, it would be like this:

<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-spark_2.10</artifactId>
<classifier>spark_1.6</classifier>
<version>0.13.2</version>
</dependency>

I have a provisional PR out implementing this:
https://github.com/apache/mahout/pull/330

Feel free to respond here or on the PR, but does anyone have any specific
objection to this method? It _seems_ the "maven" way to do things, though I
am not certain that is correct as I have never come across classifiers
before.

From [1], "As a motivation for this element, consider for example a project
that offers an artifact targeting JRE 1.5 but at the same time also an
artifact that still supports JRE 1.4. The first artifact could be equipped
with the classifier jdk15 and the second one with jdk14 such that clients
can choose which one to use." Which seems to be our use case (sort of). I
entirely concede that I may be wrong here.

[1] https://maven.apache.org/pom.html

+1 if so (sbt naming re: pats comment).
Also +1 on Zeppelin integration being non-trivial.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 07/07/2017 10:35 PM (GMT-08:00)
Subject: Re: [DISCUSS] Naming convention for multiple spark/scala combos
IIRC these all fit sbtâs conventons?
So to tie all of this together-
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_1_6
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2_0
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2_1
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_1_6
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_2_0
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_2_1
(will jars compiled with 2.1 dependencies run on 2.0? I assume not, but I
don't know) (afaik, mahout compiled for spark 1.6.x tends to work with
spark 1.6.y, anecdotal)
A non-trivial motivation here, is we would like all of these available to
tighten up the Apache Zeppelin integration, where the user could have a
number of different spark/scala combos going on and we want it to 'just
work' out of the box (which means a wide array of binaries available, to
dmitriy's point).
I'm +1 on this, and as RM will begin cutting a provisional RC, just to try
to figure out how all of this will work (it's my first time as release
master, and this is a new thing we're doing).
72 hour lazy consensus. (will probably take me 72 hours to figure out
anyway ;) )
If no objections expect an RC on Monday evening.
tg

and

Post by Holden Karau
2.11 artifacts for Spark 2.X series only and use that to differentiate (I
don't personally like this approach since it is super opaque and someone
could upgrade their Scala version and then accidentally be using a
different version of Spark which would likely not go very well).
For folks who need to hook into internals and cross build against
different minor versions there is much less of a consistent pattern,
[artifactname]_[scalaversion]:[sparkversion]_[artifact releaseversion]
But this really only makes sense when you have to cross-build for lots of
different Spark versions (which should be avoidable for Mahout).
Since you are likely not depending on the internals of different point
releases, I'd think the _spark1 / _spark2 is probably the right way (or
_spark_1 / _spark_2 is fine too).

--
Cell : 425-233-8271 <(425)%20233-8271>