Jim,
if ALS is of interest, and as far as weighed ALS is concerned (since we
already have trivial regularized ALS in the "decompositions" package),
here's uncommitted samsara-compatible patch from a while back:
https://issues.apache.org/jira/browse/MAHOUT-1365
it combines weights on both data points (a.k.a "implicit feedback" als) and
regularization rates (paper references are given). We combine both
approaches in one (which is novel, i guess, but yet simple enough).
Obviously the final solver can also be used as pure reg rate regularized if
wanted, making it equivalent to one of the papers.
You may know implicit feedback paper from mllib's implicit als, but unlike
it was done over there (as a use case sort problem that takes input before
even features were extracted), we split the problem into pure algebraic
solver (double-weighed ALS math) and leave the feature extraction outside
of this issue per se (it can be added as a separate adapter).
The reason for that is that the specific use-case oriented implementation
does not necessarily leave the space for feature extraction that is
different from described use case of partially consumed streamed videos in
the paper. (e.g., instead of videos one could count visits or clicks or
add-to-cart events which may need additional hyperparameter found for them
as part of feature extraction and converting observations into "weghts").
The biggest problem with these ALS methods however is that all
hyperparameters require multidimensional crossvalidation and optimization.
I think i mentioned it before as list of desired solutions, as it stands,
Mahout does not have hyperarameter fitting routine.
In practice, when using these kind of ALS, we have a case of
multidimensional hyperparameter optimization. One of them comes from the
fitter (reg rate, or base reg rate in case of weighed regularization), and
the others come from feature extraction process. E.g., in original paper
they introduce (at least) 2 formulas to extract measure weighs from the
streaming video observations, and each of them had one parameter, alhpa,
which in context of the whole problem becomes effectively yet another
hyperparameter to fit. In other use cases when your confidence measurement
may be coming from different sources and observations, the confidence
extraction may actually have even more hyperparameters to fit than just
one. And when we have a multidimensional case, simple approaches (like grid
or random search) become either cost prohibitive or ineffective, due to the
curse of dimensionality.
At the time i was contributing that method, i was using it in conjunction
with multidimensional bayesian optimizer, but the company that i wrote it
for did not have it approved for contribution (unlike weighed als) at that
time.
Anyhow, perhaps you could read the algebra in both ALS papers there and ask
questions, and we could worry about hyperparameter optimization a bit later
and performance a bit later.
On the feature extraction front (as in implicit feedback als per Koren
etc.), this is an ideal use case for more general R-like formula approach,
which is also on desired list of things to have.
So i guess we have 3 problems really here:
(1) double-weighed ALS
(2) bayesian optimization and crossvalidation in an n-dimensional
hyperparameter space
(3) feature extraction per (preferrably R-like) formula.
-d
+1 to glms
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 02/17/2017 6:56 AM (GMT-08:00)
Subject: Re: Contributing an algorithm for samsara
Jim is right, and I would take it one further and say, it would be best to
implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
from there a Logistic regression is a trivial extension.
Buyer beware- GLMs will be a bit of work- doable, but that would be jumping
in neck first for both Jim and Saikat...
MAHOUT-1928 and MAHOUT-1929
https://issues.apache.org/jira/browse/MAHOUT-1925?jql=
project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%
20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%
20DESC%2C%20created%20ASC
^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are
in there.
If you have an algorithm you are particularly intimate with, or explicitly
need/want- feel free to open a JIRA and assign to yourself.
There is also a case to be made for implementing the ALS...
1) It's a much better 'beginner' project.
2) Mahout has some world class Recommenders, a toy ALS implementation might
help us think through how the other reccomenders (e.g. CCO) will 'fit' into
the framework. E.g. ALS being the toy-prototype reccomender that helps us
think through building out that section of the framework.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Jim JagielskiMy own thoughts are that logistic regression seems a more "generalized"
and hence more useful algo to be factored in... At least in the
use cases that I've been toying with.
So I'd like to help out with that if wanted...
Post by Saikat KanjilalTrevor et al,
I'd like to contribute an algorithm or two in samsara using spark as I
would like to do a compare and contrast with mahout with R server for a
data science pipeline, machine learning repo that I'm working on, in
looking at the list of algorithms (https://mahout.apache.org/
users/basics/algorithms.html) is there an algorithm for spark that would
be beneficial for the community, my use cases would typically be around
clustering or real time machine learning for building recommendations on
the fly. The algorithms I see that could potentially be useful are: 1)
Matrix Factorization with ALS 2) Logistic regression with SVD.
Post by Saikat KanjilalApache Mahout: Scalable machine learning and data mining<
https://mahout.apache.org/users/basics/algorithms.html>
Post by Saikat Kanjilalmahout.apache.org
Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
Flink; Mahout Math-Scala Core Library and Scala DSL
Post by Saikat KanjilalAny thoughts/guidance or recommendations would be very helpful.
Thanks in advance.