Intro from a lurker

Discussion:

Intro from a lurker

Jim Jagielski

2017-02-09 04:18:39 UTC

Hello there!

I've been following Mahout and lurking for a long, long time.
I never really had the cycles to get involved, but I always
found it a very interesting project, and the whole ML area
as fascinating as well...

Anyway, I'm looking for some new projects to basically
help out with and have fun with, and Mahout is on my short
list. So before I try "jumping in" I thought I'd send out
a quick "Hello", see if there are things that people would
encourage me to "look at first" and just give fair warning. :)

Cheers!

Suneel Marthi

2017-02-09 04:50:18 UTC

Permalink

Curious JimJag,
Did some dude from CapitalOne poke u about Mahout, I was doing a talk to
some CapOne and other DC Big Data folks in Vienna VA some 2-3 weeks back
and pretty much lost the entire crowd when I got to explaining column-row
matrix multiplication. âº

Post by Jim Jagielski
Hello there!
I've been following Mahout and lurking for a long, long time.
I never really had the cycles to get involved, but I always
found it a very interesting project, and the whole ML area
as fascinating as well...
Anyway, I'm looking for some new projects to basically
help out with and have fun with, and Mahout is on my short
list. So before I try "jumping in" I thought I'd send out
a quick "Hello", see if there are things that people would
encourage me to "look at first" and just give fair warning. :)
Cheers!

Jim Jagielski

2017-02-09 14:34:52 UTC

Permalink

Post by Suneel Marthi
Curious JimJag,
Did some dude from CapitalOne poke u about Mahout

Not really, no...

Dmitriy Lyubimov

2017-02-09 18:48:36 UTC

Permalink

Jim, let me start by stating it's an (unexpected on my side) honor. Are you
willing to get hands-on at this point in numerical problems (or have
resources that can get hands-on)?

Short modern Mahout story (as short as it is possible to be short)

Most nagging problem: lack of support by industry and/or academia. We have
capable committers but less capable capable backers in terms of willingness
to sanction contributions.

Current mahout development goes 2 ways: (a) the platform (aka `samsara`);
and (b) useful, preferrably end2end use case scenarious, or just
methodology implementation. Note that while (b) is intended to use (a) (and
gain backend portability as a bonus), it is not strictly required as long
as the backend-speicific code could be fairly easily ported to other
backends. Still though, if we come across a need for custom code, we try to
analyze the situation if it is something that might be a fairly common
abstraction so we could add it to the formalisms list we got in the
platform and avoid repetition in the future. Platform primer could be found
on the site, I won't be getting into that now.

In the platform the problem #1, currently, is the performance. Not that it
is generally bad, but some pieces are limited by back-ends. We did some
in-memory work to integrate more performing backends there but the effort
is constrained by our immediate capacities to contribute, and the most
glaring issue (as one of visitors duly noted in jira) is that the
distributed backends we are trying to run are severely limited in terms of
interconnected algebraic problems. We have ideas what to do here though.

It is the very distributed performance of interconnected numerical problems
of the current backends (flink, spark) which precludes Mahout from being a
pragmatical platform for implementing deep learning at scale, for example.
I suppose in-memory performance should be ok for that purpose once we have
added GPU and DL specific GPU primitives. The in-memory improvements are
not complete for everything that would be ideal, but there has been some
notable progress there.

With methodologies, well, there's no one single most pressing problem, it
is really just defined by a pragmatical problem one has at hand. Currently,
Trevor does the most of this outstanding work. It simply and preferably
should be a more edgy than most distributed packages offer.

E.g., decent-to-good bayesian optimization for hyperparameters, or say I
was suggesting to experiment with LRFM recommendation techniques for a few
years, as they significantly expand on type of predictors the method can
take, and their treatment, compared to things like COO or implicit feedback
behavior-based recommenders. Another example is there's no good coverage in
clustering in terms of _type_ of clustering -- mixtures, density, spectral,
not just traditional centroid type of methods. Visualization techniques,
even as simple as 2d density estimators for big datasets are also in
demand. Generally speaking, industry has stepped far ahead in terms of
visualization approaches than commonly is available in open source
software. Bottom line, the only guidance here i see is -- "don't be
trivial. Seek unique value proposition". But most guiding principle so far
was people's pragmatism: "I have actual production use case and/or very
specific requirements for that, I want to use the methodology X for that,
and I don't seem to be able to find it elsewhere under management of a
distributed platform Y".

-d

Post by Jim Jagielski

Post by Suneel Marthi
Curious JimJag,
Did some dude from CapitalOne poke u about Mahout

Not really, no...

Andrew Musselman

2017-02-09 18:59:10 UTC

Permalink

Ditto, thanks for reaching out Jim; grateful for your offer. We are cutting
an 0.13 release in the next couple weeks and I know we could use help
testing/signing/etc.

Best
Andrew

Post by Dmitriy Lyubimov
Jim, let me start by stating it's an (unexpected on my side) honor. Are you
willing to get hands-on at this point in numerical problems (or have
resources that can get hands-on)?
Short modern Mahout story (as short as it is possible to be short)
Most nagging problem: lack of support by industry and/or academia. We have
capable committers but less capable capable backers in terms of willingness
to sanction contributions.
Current mahout development goes 2 ways: (a) the platform (aka `samsara`);
and (b) useful, preferrably end2end use case scenarious, or just
methodology implementation. Note that while (b) is intended to use (a) (and
gain backend portability as a bonus), it is not strictly required as long
as the backend-speicific code could be fairly easily ported to other
backends. Still though, if we come across a need for custom code, we try to
analyze the situation if it is something that might be a fairly common
abstraction so we could add it to the formalisms list we got in the
platform and avoid repetition in the future. Platform primer could be found
on the site, I won't be getting into that now.
In the platform the problem #1, currently, is the performance. Not that it
is generally bad, but some pieces are limited by back-ends. We did some
in-memory work to integrate more performing backends there but the effort
is constrained by our immediate capacities to contribute, and the most
glaring issue (as one of visitors duly noted in jira) is that the
distributed backends we are trying to run are severely limited in terms of
interconnected algebraic problems. We have ideas what to do here though.
It is the very distributed performance of interconnected numerical problems
of the current backends (flink, spark) which precludes Mahout from being a
pragmatical platform for implementing deep learning at scale, for example.
I suppose in-memory performance should be ok for that purpose once we have
added GPU and DL specific GPU primitives. The in-memory improvements are
not complete for everything that would be ideal, but there has been some
notable progress there.
With methodologies, well, there's no one single most pressing problem, it
is really just defined by a pragmatical problem one has at hand. Currently,
Trevor does the most of this outstanding work. It simply and preferably
should be a more edgy than most distributed packages offer.
E.g., decent-to-good bayesian optimization for hyperparameters, or say I
was suggesting to experiment with LRFM recommendation techniques for a few
years, as they significantly expand on type of predictors the method can
take, and their treatment, compared to things like COO or implicit feedback
behavior-based recommenders. Another example is there's no good coverage in
clustering in terms of _type_ of clustering -- mixtures, density, spectral,
not just traditional centroid type of methods. Visualization techniques,
even as simple as 2d density estimators for big datasets are also in
demand. Generally speaking, industry has stepped far ahead in terms of
visualization approaches than commonly is available in open source
software. Bottom line, the only guidance here i see is -- "don't be
trivial. Seek unique value proposition". But most guiding principle so far
was people's pragmatism: "I have actual production use case and/or very
specific requirements for that, I want to use the methodology X for that,
and I don't seem to be able to find it elsewhere under management of a
distributed platform Y".
-d

Post by Jim Jagielski

Post by Suneel Marthi
Curious JimJag,
Did some dude from CapitalOne poke u about Mahout

Not really, no...

Jim Jagielski

2017-02-10 12:06:48 UTC

Permalink

Wow... I don't think I've EVER encountered a welcome like this!

Thanks for all the info and pointers... I plan to dig in over the
weekend and really digest the emails and see where I can make some
immediate (or semi-immediate ;) ) contributions.

Cheers!

Andrew Musselman

2017-02-11 04:55:14 UTC

Permalink

Sounds good, thanks. Happy to invite you to prep and release chats if you'd
like; let us know.

Post by Jim Jagielski
Wow... I don't think I've EVER encountered a welcome like this!
Thanks for all the info and pointers... I plan to dig in over the
weekend and really digest the emails and see where I can make some
immediate (or semi-immediate ;) ) contributions.
Cheers!

Jim Jagielski

2017-02-17 13:57:07 UTC

Permalink

Yes, please! Thx!

Post by Andrew Musselman
Sounds good, thanks. Happy to invite you to prep and release chats if you'd
like; let us know.

Saikat Kanjilal

2017-02-20 17:13:50 UTC

Permalink

@AndrewM,

I had some ideas on mahout-1894 and tying that into a set of perf tests that would be useful for mahout, I'd also like to get a general understanding of all the current issues for the current release , is it possible to attend the prep and release chats or is that localized to only a few people, if not I'd love a summary of current issues/blockers so that I can help.

Thanks in advance.

________________________________
From: Andrew Musselman <***@gmail.com>
Sent: Friday, February 10, 2017 8:55 PM
To: ***@mahout.apache.org
Subject: Re: Intro from a lurker

Sounds good, thanks. Happy to invite you to prep and release chats if you'd
like; let us know.

Trevor Grant

2017-02-21 13:41:12 UTC

Permalink

Hey Saikat-

Drop any thoughts you have on 1894 on the JIRA ticket or the PR pls (will
make it easier to track the conversation in that way).

Issues for the current release can be found:
https://issues.apache.org/jira/browse/MAHOUT-1907?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%200.13.0%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC%2C%20created%20ASC

Everything on that list is what we would like to get into 0.13.0, and not
much else.

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

Post by Saikat Kanjilal
@AndrewM,
I had some ideas on mahout-1894 and tying that into a set of perf tests
that would be useful for mahout, I'd also like to get a general
understanding of all the current issues for the current release , is it
possible to attend the prep and release chats or is that localized to only
a few people, if not I'd love a summary of current issues/blockers so that
I can help.
Thanks in advance.
________________________________
Sent: Friday, February 10, 2017 8:55 PM
Subject: Re: Intro from a lurker
Sounds good, thanks. Happy to invite you to prep and release chats if you'd
like; let us know.

Trevor Grant

2017-02-09 04:55:10 UTC

Permalink

Hey Jim!

Would love to have you help out.

A first step might be familiarizing your self a bit with the Mahout R-Like
DSL Samsera:
https://mahout.apache.org/users/environment/in-core-reference.html
https://mahout.apache.org/users/environment/out-of-core-reference.html

A good place to start would be looking at the JIRA board for 'beginner'
issues https://issues.apache.org/jira/browse/MAHOUT-1930?filter=12339671

The new algorithms framework is in place,
https://github.com/apache/mahout/tree/master/math-scala/src/main/scala/org/apache/mahout/math/algorithms

It is pretty sparse right now (read "ripe with opportunity")- you want
something a bit more challenging and there is a particular method you are
familiar with, you might try your hand at implementing it in that
framework.

Other than that- look around the code base / JIRA boards. If there is
something that strikes you, give it a shot- we're happy to help!

And finally, I've seen you around the mailing lists on a couple of other
projects. If you happen to know any thing about migrating our website from
CMS to Jekyll- that would be a huge help too (though not as machine-learny).

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*