Making it easier to use Mahout algorithms with Apache Spark pipelines

Discussion:

Holden Karau

2017-07-07 21:32:23 UTC

Hi y'all,

Trevor and I had been talking a bit and one of the things I'm interested in
doing is trying to make it easier for the different ML libraries to be used
in Spark. Spark ML has this unified pipeline interface (which is certainly
far from perfect), but I was thinking I'd take a crack at trying to expose
some of Mahout's algorithms so that they could be used/configured with
Spark ML's pipeline interface.

I'd like to take a stab at doing that inside the mahout project, but if
it's something people feel would be better to live outside I'm happy to do
that as well.

Cheers,

Holden

For reference:

https://spark.apache.org/docs/latest/ml-pipeline.html

--
Twitter: https://twitter.com/holdenkarau

Trevor Grant

2017-07-07 22:33:26 UTC

Permalink

+1 on this.

There's precedence with spark interoperability with the various drmWrap
functions.

We've discussed pipelines in the past and roll-our-own vs. utilize
underlying engine. Inter-operating with other pipelines (Spark) doesn't
preclude that.

The goal of the pipeline discussion iirc, was to eventually get towards
automated hyper-parameter tuning. Again, I don't see conflict- maybe a way
to work in at some point?

In addition to all of this- I think convenience methods and interfaces for
more advanced spark operations will make the Mahout Learning curve less
steep, and hopefully drive adoption.

The only concern I can think of is version creep- which opens a whole other
discussion on 'how long will we support Spark 1.6' (I'm not proposing to
stop anytime soon), but as I understand a lot of the advance pipeline stuff
came about in 2.x. I think this can be easily handled- the Spark
Interpreter in Apache Zeppelin is rife with multi version support examples
(1.2 - 2.1)

Also- I don't see this affecting anything outside of the spark bindings, so
engine neutrality should be maintained (with spark getting some favorable
treatment, but at this point... we've pushed Flink to its own profile and
we keep h2o around because its not causing any trouble).

Post by Holden Karau
Hi y'all,
Trevor and I had been talking a bit and one of the things I'm interested in
doing is trying to make it easier for the different ML libraries to be used
in Spark. Spark ML has this unified pipeline interface (which is certainly
far from perfect), but I was thinking I'd take a crack at trying to expose
some of Mahout's algorithms so that they could be used/configured with
Spark ML's pipeline interface.
I'd like to take a stab at doing that inside the mahout project, but if
it's something people feel would be better to live outside I'm happy to do
that as well.
Cheers,
Holden
https://spark.apache.org/docs/latest/ml-pipeline.html
--
Twitter: https://twitter.com/holdenkarau

Holden Karau

2017-07-08 00:22:12 UTC

Permalink

The version creep is certainly an issue, normally its solved by having a
2.X directory for things that are only supported in 2.X and only including
that in the 2.X build. That being said the pipeline stuff has been around
since 1.3 (albeit as an alpha component) so we could probably make it work
for 1.3+ (but it might make sense to only bother doing for the 2.X series
since the rest of the pipeline stages in Spark weren't really well fleshed
out in the 1.X branch).

Post by Trevor Grant
+1 on this.
There's precedence with spark interoperability with the various drmWrap
functions.
We've discussed pipelines in the past and roll-our-own vs. utilize
underlying engine. Inter-operating with other pipelines (Spark) doesn't
preclude that.
The goal of the pipeline discussion iirc, was to eventually get towards
automated hyper-parameter tuning. Again, I don't see conflict- maybe a way
to work in at some point?
In addition to all of this- I think convenience methods and interfaces for
more advanced spark operations will make the Mahout Learning curve less
steep, and hopefully drive adoption.
The only concern I can think of is version creep- which opens a whole other
discussion on 'how long will we support Spark 1.6' (I'm not proposing to
stop anytime soon), but as I understand a lot of the advance pipeline stuff
came about in 2.x. I think this can be easily handled- the Spark
Interpreter in Apache Zeppelin is rife with multi version support examples
(1.2 - 2.1)
Also- I don't see this affecting anything outside of the spark bindings, so
engine neutrality should be maintained (with spark getting some favorable
treatment, but at this point... we've pushed Flink to its own profile and
we keep h2o around because its not causing any trouble).

Post by Holden Karau
Hi y'all,
Trevor and I had been talking a bit and one of the things I'm interested

Post by Holden Karau
doing is trying to make it easier for the different ML libraries to be

used

Post by Holden Karau
in Spark. Spark ML has this unified pipeline interface (which is

certainly

Post by Holden Karau
far from perfect), but I was thinking I'd take a crack at trying to

expose

Post by Holden Karau
some of Mahout's algorithms so that they could be used/configured with
Spark ML's pipeline interface.
I'd like to take a stab at doing that inside the mahout project, but if
it's something people feel would be better to live outside I'm happy to

Post by Holden Karau
that as well.
Cheers,
Holden
https://spark.apache.org/docs/latest/ml-pipeline.html
--
Twitter: https://twitter.com/holdenkarau

--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Andrew Musselman

2017-07-10 00:33:40 UTC

Permalink

Holden, sounds good to me; the only thing I'd be cautious of is how
dependent we get on that other project but I don't think it's a big risk.

Thanks!

Holden, great to have you here. This sounds great! Easier
interoperability with Spark and a ease of the Mahout learning curve IMO are
huge priorities.
I am conceptually +1 on this as well (only minor concerns are with our
goals of preserving engine neutrality as best we can). With the precedence
of Spark having favorable treatment, as Trevor pointed out, this should not
be much of a problem.

Post by Trevor Grant
Also- I don't see this affecting anything outside of the spark bindings,

so
engine neutrality should be maintained (with spark getting some favorable
treatment, but at this point... we've pushed Flink to its own profile and
we keep h2o around because its not causing any trouble).
I believe that this could fit into our high level algorithm framework (in
math-scala)...
https://github.com/apache/mahout/tree/master/math-scala/
src/main/scala/org/apache/mahout/math/algorithms
It seems so. Keeping pipeline interfaces in a high level module, dropping
down to the spark module and extending for Spark only (which in this case
would likely be most of the work) and then adding stubs for Flink and h2o
for future developers that may have interest would be best IMO.
There is precedence here as well. E.g.: `IndexedDataset`s.
profile for h2o for symmetry but that is an other discussion.
--andy
________________________________
Sent: Friday, July 7, 2017 8:22:12 PM
Subject: Re: Making it easier to use Mahout algorithms with Apache Spark
pipelines
The version creep is certainly an issue, normally its solved by having a
2.X directory for things that are only supported in 2.X and only including
that in the 2.X build. That being said the pipeline stuff has been around
since 1.3 (albeit as an alpha component) so we could probably make it work
for 1.3+ (but it might make sense to only bother doing for the 2.X series
since the rest of the pipeline stages in Spark weren't really well fleshed
out in the 1.X branch).

way

Post by Trevor Grant
to work in at some point?
In addition to all of this- I think convenience methods and interfaces

for

Post by Trevor Grant
more advanced spark operations will make the Mahout Learning curve less
steep, and hopefully drive adoption.
The only concern I can think of is version creep- which opens a whole

other

Post by Trevor Grant
discussion on 'how long will we support Spark 1.6' (I'm not proposing to
stop anytime soon), but as I understand a lot of the advance pipeline

stuff

Post by Trevor Grant
came about in 2.x. I think this can be easily handled- the Spark
Interpreter in Apache Zeppelin is rife with multi version support

examples

Post by Trevor Grant
(1.2 - 2.1)
Also- I don't see this affecting anything outside of the spark bindings,

Post by Trevor Grant
engine neutrality should be maintained (with spark getting some favorable
treatment, but at this point... we've pushed Flink to its own profile and
we keep h2o around because its not causing any trouble).

Post by Holden Karau
Hi y'all,
Trevor and I had been talking a bit and one of the things I'm

interested

Post by Trevor Grant
in

Post by Holden Karau
doing is trying to make it easier for the different ML libraries to be

used

Post by Holden Karau
in Spark. Spark ML has this unified pipeline interface (which is

certainly

Post by Holden Karau
far from perfect), but I was thinking I'd take a crack at trying to

expose

Post by Holden Karau
that as well.
Cheers,
Holden
https://spark.apache.org/docs/latest/ml-pipeline.html
--
Twitter: https://twitter.com/holdenkarau

--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau