Jim, let me start by stating it's an (unexpected on my side) honor. Are you
willing to get hands-on at this point in numerical problems (or have
resources that can get hands-on)?
Short modern Mahout story (as short as it is possible to be short)
Most nagging problem: lack of support by industry and/or academia. We have
capable committers but less capable capable backers in terms of willingness
to sanction contributions.
Current mahout development goes 2 ways: (a) the platform (aka `samsara`);
and (b) useful, preferrably end2end use case scenarious, or just
methodology implementation. Note that while (b) is intended to use (a) (and
gain backend portability as a bonus), it is not strictly required as long
as the backend-speicific code could be fairly easily ported to other
backends. Still though, if we come across a need for custom code, we try to
analyze the situation if it is something that might be a fairly common
abstraction so we could add it to the formalisms list we got in the
platform and avoid repetition in the future. Platform primer could be found
on the site, I won't be getting into that now.
In the platform the problem #1, currently, is the performance. Not that it
is generally bad, but some pieces are limited by back-ends. We did some
in-memory work to integrate more performing backends there but the effort
is constrained by our immediate capacities to contribute, and the most
glaring issue (as one of visitors duly noted in jira) is that the
distributed backends we are trying to run are severely limited in terms of
interconnected algebraic problems. We have ideas what to do here though.
It is the very distributed performance of interconnected numerical problems
of the current backends (flink, spark) which precludes Mahout from being a
pragmatical platform for implementing deep learning at scale, for example.
I suppose in-memory performance should be ok for that purpose once we have
added GPU and DL specific GPU primitives. The in-memory improvements are
not complete for everything that would be ideal, but there has been some
notable progress there.
With methodologies, well, there's no one single most pressing problem, it
is really just defined by a pragmatical problem one has at hand. Currently,
Trevor does the most of this outstanding work. It simply and preferably
should be a more edgy than most distributed packages offer.
E.g., decent-to-good bayesian optimization for hyperparameters, or say I
was suggesting to experiment with LRFM recommendation techniques for a few
years, as they significantly expand on type of predictors the method can
take, and their treatment, compared to things like COO or implicit feedback
behavior-based recommenders. Another example is there's no good coverage in
clustering in terms of _type_ of clustering -- mixtures, density, spectral,
not just traditional centroid type of methods. Visualization techniques,
even as simple as 2d density estimators for big datasets are also in
demand. Generally speaking, industry has stepped far ahead in terms of
visualization approaches than commonly is available in open source
software. Bottom line, the only guidance here i see is -- "don't be
trivial. Seek unique value proposition". But most guiding principle so far
was people's pragmatism: "I have actual production use case and/or very
specific requirements for that, I want to use the methodology X for that,
and I don't seem to be able to find it elsewhere under management of a
distributed platform Y".
-d
Post by Jim JagielskiPost by Suneel MarthiCurious JimJag,
Did some dude from CapitalOne poke u about Mahout
Not really, no...