Suneel Marthi
2016-05-17 14:01:09 UTC
Thanks Simone for pointing this out.
On the Apache Mahout project we have distributed linear algebra with R-like
semantics that can be executed on Spark/Flink/H2O.
@Kam: the document u point out is old and outdated, the most up-to-date
reference to the Samsara api is the book - 'Apache Mahout: Beyond
MapReduce". (shameless marketing here on behalf of fellow committers :) )
We added Flink DataSet API in the recent Mahout 0.12.0 release (April 11,
2016) and has been called out in my talk at ApacheBigData in Vancouver last
week.
The Mahout community would definitely be interested in being involved with
this and sharing notes.
IMHO, the focus should be first on building a good linalg foundations
before embarking on building algos and pipelines. Adding @dlyubimov to this.
---------- Forwarded message ----------
From: Simone Robutti <***@radicalbit.io>
Date: Tue, May 17, 2016 at 9:48 AM
Subject: Fwd: machine learning API, common models
To: Suneel Marthi <***@apache.org>
---------- Forwarded message ----------
From: Kavulya, Soila P <***@intel.com>
Date: 2016-05-17 1:53 GMT+02:00
Subject: RE: machine learning API, common models
To: "***@beam.incubator.apache.org" <***@beam.incubator.apache.org>
Thanks Simone,
You have raised a valid concern about how different frameworks will have
different implementations and parameter semantics for the same algorithm. I
agree that it is important to keep this in mind. Hopefully, through this
exercise, we will identify a good set of common ML abstractions across
different frameworks.
Feel free to edit the document. We had limited the first pass of the
comparison matrix to the machine learning pipeline APIs, but we can extend
it to include other ML building blocks like linear algebra operations, and
APIs for optimizers like gradient descent.
Soila
-----Original Message-----
From: Kam Kasravi [mailto:***@gmail.com]
Sent: Monday, May 16, 2016 8:22 AM
To: ***@beam.incubator.apache.org
Subject: Re: machine learning API, common models
Thanks Simone - yes I had read your concerns on dev and I think they're
well founded.
Thanks for the samsura reference - I've been looking at the spark/scala
bindings http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf
.
I think we should expand the document to include linear algebraic ops or
least pay due diligence to it. If you're doing anything on the flink side
in this regard let us or feel free to suggest edits/updates to the document.
Thanks
Kam
On Mon, May 16, 2016 at 6:05 AM, Simone Robutti <
I'm
flink, etc).
https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1
yjo4
PBECHb-xA/edit?usp=sharing
On the Apache Mahout project we have distributed linear algebra with R-like
semantics that can be executed on Spark/Flink/H2O.
@Kam: the document u point out is old and outdated, the most up-to-date
reference to the Samsara api is the book - 'Apache Mahout: Beyond
MapReduce". (shameless marketing here on behalf of fellow committers :) )
We added Flink DataSet API in the recent Mahout 0.12.0 release (April 11,
2016) and has been called out in my talk at ApacheBigData in Vancouver last
week.
The Mahout community would definitely be interested in being involved with
this and sharing notes.
IMHO, the focus should be first on building a good linalg foundations
before embarking on building algos and pipelines. Adding @dlyubimov to this.
---------- Forwarded message ----------
From: Simone Robutti <***@radicalbit.io>
Date: Tue, May 17, 2016 at 9:48 AM
Subject: Fwd: machine learning API, common models
To: Suneel Marthi <***@apache.org>
---------- Forwarded message ----------
From: Kavulya, Soila P <***@intel.com>
Date: 2016-05-17 1:53 GMT+02:00
Subject: RE: machine learning API, common models
To: "***@beam.incubator.apache.org" <***@beam.incubator.apache.org>
Thanks Simone,
You have raised a valid concern about how different frameworks will have
different implementations and parameter semantics for the same algorithm. I
agree that it is important to keep this in mind. Hopefully, through this
exercise, we will identify a good set of common ML abstractions across
different frameworks.
Feel free to edit the document. We had limited the first pass of the
comparison matrix to the machine learning pipeline APIs, but we can extend
it to include other ML building blocks like linear algebra operations, and
APIs for optimizers like gradient descent.
Soila
-----Original Message-----
From: Kam Kasravi [mailto:***@gmail.com]
Sent: Monday, May 16, 2016 8:22 AM
To: ***@beam.incubator.apache.org
Subject: Re: machine learning API, common models
Thanks Simone - yes I had read your concerns on dev and I think they're
well founded.
Thanks for the samsura reference - I've been looking at the spark/scala
bindings http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf
.
I think we should expand the document to include linear algebraic ops or
least pay due diligence to it. If you're doing anything on the flink side
in this regard let us or feel free to suggest edits/updates to the document.
Thanks
Kam
On Mon, May 16, 2016 at 6:05 AM, Simone Robutti <
Hello,
I'm Simone and I just began contributing to Flink ML (actually on the
distributed linalg part). I already expressed my concerns about the
different implementations produce different results and may vary in
quality. Also the semantics of parameters may change from one
implementation to the other. This could hinder portability and
transparency. I believe these problems could be handled paying the due
attention to the details of every single implementation but I invite
you not to underestimate these problems.
On the other hand the API in itself looks good to me. From my side, I
hope to fill some of the gaps in Flink you underlined in the comparison
matrix.I'm Simone and I just began contributing to Flink ML (actually on the
distributed linalg part). I already expressed my concerns about the
different implementations produce different results and may vary in
quality. Also the semantics of parameters may change from one
implementation to the other. This could hinder portability and
transparency. I believe these problems could be handled paying the due
attention to the details of every single implementation but I invite
you not to underestimate these problems.
On the other hand the API in itself looks good to me. From my side, I
hope to fill some of the gaps in Flink you underlined in the comparison
Talking about matrices, proper matrices this time, I believe it would
be useful to include in this API support for linear algebra operations.
Something similar is already present in Mahout's Samsara and it looks
really good but clearly a similar implementation on Beam would be way
more interesting and powerful.
My 2 cents,
Simone
thebe useful to include in this API support for linear algebra operations.
Something similar is already present in Mahout's Samsara and it looks
really good but clearly a similar implementation on Beam would be way
more interesting and powerful.
My 2 cents,
Simone
Hi Tyler,
Thank you so much for your feedback. I agree that starting with the
high-level API is a good direction. We are interested in Python
because
itThank you so much for your feedback. I agree that starting with the
high-level API is a good direction. We are interested in Python
because
is the language that our data scientists are most familiar with. I
think starting with Java would be the best approach, because the
Python API can be a thin wrapper for Java API.
In Spark, the Scala, Java and Python APIs are identical. Flink does
not have a Python API for ML pipelines at present.
Could you point me to the updated runner API?
Soila
-----Original Message-----
Sent: Friday, May 13, 2016 6:34 PM
Subject: Re: machine learning API, common models
Hi Kam & Soila,
Thanks a lot for writing this up. I ran the doc past some of the
folks who've been doing ML work here at Google, and they were
generally happy with the distillation of common methods in the doc.
I'd be curious to
hearthink starting with Java would be the best approach, because the
Python API can be a thin wrapper for Java API.
In Spark, the Scala, Java and Python APIs are identical. Flink does
not have a Python API for ML pipelines at present.
Could you point me to the updated runner API?
Soila
-----Original Message-----
Sent: Friday, May 13, 2016 6:34 PM
Subject: Re: machine learning API, common models
Hi Kam & Soila,
Thanks a lot for writing this up. I ran the doc past some of the
folks who've been doing ML work here at Google, and they were
generally happy with the distillation of common methods in the doc.
I'd be curious to
what folks on the Flink- and Spark- runner sides think.
To me, this seems like a good direction for a high-level API.
Presumably, once a high-level API is in place, we could begin
looking at what it
wouldTo me, this seems like a good direction for a high-level API.
Presumably, once a high-level API is in place, we could begin
looking at what it
take to add lower-level ML algorithm support (e.g. iterative) to the
Beam Model. Is this essentially what you're thinking?
- Presumably you'd want to tackle this in Java first, since that's
Beam Model. Is this essentially what you're thinking?
- Presumably you'd want to tackle this in Java first, since that's
only language we currently support? Given that half of your
examples are in
Python, I'm also assuming Python will be interesting once it's
available.
- Along those lines, what languages are represented in the capability
matrix? E.g. is Spark ML support as detailed there identical across
Java/Scala and Python?
- Have you thought about how this would tie in at the runner level,
particularly given the updated Runner API changes that are coming?
examples are in
Python, I'm also assuming Python will be interesting once it's
available.
- Along those lines, what languages are represented in the capability
matrix? E.g. is Spark ML support as detailed there identical across
Java/Scala and Python?
- Have you thought about how this would tie in at the runner level,
particularly given the updated Runner API changes that are coming?
assuming they'd be provided as composite transforms that (for
now)
wouldnow)
have no default implementation, given the lack of low-level
primitives for
ML algorithms, but am curious what your thoughts are there.
- I still don't fully understand how incremental updates due to model
drift would tie in at the API level. There's a comment thread in
the
docprimitives for
ML algorithms, but am curious what your thoughts are there.
- I still don't fully understand how incremental updates due to model
drift would tie in at the API level. There's a comment thread in
the
still open tracking this, so no need to comment here additionally.
Justpointing it out as one of the things that stands out as
potentially having
API-level impacts to me that doesn't seem 100% fleshed out in the
doc yet
(thought that admittedly may just be my limited understanding at
this point
:-).
-Tyler
potentially having
API-level impacts to me that doesn't seem 100% fleshed out in the
doc yet
(thought that admittedly may just be my limited understanding at
this point
:-).
-Tyler
Hi Tyler - my bad. Comments should be enabled now.
On Fri, May 13, 2016 at 10:45 AM, Tyler Akidau
On Fri, May 13, 2016 at 10:45 AM, Tyler Akidau
Thanks a lot, Kam. Can you please enable comment access on the doc?
I
seemI
to have view access only.
-Tyler
On Fri, May 13, 2016 at 9:54 AM Kam Kasravi
-Tyler
On Fri, May 13, 2016 at 9:54 AM Kam Kasravi
Hi
A number of readers have made comments on this topic recently.
We have created a document that does some analysis of common
ML models and
relatedA number of readers have made comments on this topic recently.
We have created a document that does some analysis of common
ML models and
APIs. We hope this can drive an approach that will result in
an API, compatibility matrix and involvement from the same
groups that are implementing transformation runners (spark,
an API, compatibility matrix and involvement from the same
groups that are implementing transformation runners (spark,
We welcome comments here or in the document itself.
yjo4
PBECHb-xA/edit?usp=sharing