Traits for a mahout algorithm Library.

I was thinking so too. Most ML frameworks are at least loosly based on the
Sklearn paradigm. For those not familiar, at a very abstract level-

model1 = new Algo // e.g. K-Means, Random Forest, Neural Net

model1.fit(trainingData)

// then depending on the goal of the algorithm you have either (or both)
preds = model1.predict( testData) // which returns a vector of predictions
for each obs point in testing data

// or sometimes
newVals = model1.transform( testData) // which returns a new dataset like
object, as this makes more sense for things like neural nets, or when
you're not just predicting a single value per observation

In addition to the above, pre-processing operations then also have a
transform method such as

preprocess1 = new Normalizer

preprocess1.fit( trainingData ) // in this phase calculates the mean and
variance of the training data set

preprocessedTrainingData = preprocess1.transform( trainingData)
preprocessTestingData = preprocess1.transform( testingData)

I think this is a reasonalbe approach bc A) it makes sense and B) is a
standard of sorts across ML libraries (bc of A)

We have two high level bucket types, based on what the output is:

Predictors and Transformers

Predictors: anything that return a single value per observation, this is
classifiers and regressors

Transformers: anything that returns a vector per observation
- Pre-processing operations
- Classifiers, in that usually there is a probability vector for each
observation as to which class it belongs too, the 'predict' method then
just picks the most likely class
- Neural nets ( though with one small tweak can be extended to regression
or classification )
- Any unsupervised learning application (e.g. clustering)
- etc.

And so really we have something like:

class LearningFunction
def fit()

class Transformer extends LearningFunction:
def transform

class Predictor extends Transformer:
def predict

This paradigm also lends its self nicely to pipelines...

pipeline1 = new Pipeline
.add( transformer1 )
.add( transformer2 )
.add( model1 )

pipeline1.fit( trainingData )
pipelin1.predict( testingData )

I have to read up on reccomenders a bit more to figure how those play in,
or if we need another class.

In addition to that I think we would have an optimizers section that allows
for the various flavors of SGD, but also allows other types of optimizers
all together.

Again, just moving the conversation forward a bit here.

Excited to get to work on this

Best,

tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

Post by Sebastian
Hi Andrew,
I think this topic is broader than just defining a few traits. A popular
way of integrating ML algorithms is via the combination of dataframes and
pipelines, similar to what scipy and SparkML are offering at the moment.
Maybe it could make sense to integrate with what they have instead of
starting our own efforts?
Best,
Sebastian

Hi All,
https://issues.apache.org/jira/browse/MAHOUT-1856
This is a discussion that has popped up several times over the last
couple of years. as we move towards building out our algorithm library, It
would be great to nail this down now.
Most Importantly to not be able to be criticized as "a loose bag of
algorithms" as we've sometimes been in the past.
The main point being It would be good to lay out common traits for
Classification, Clustering, and Optimization algorithms.
This is just a start. I created this issue a few months back, and
intentionally left off Recommender, because I was unsure if there were
common traits across them. By traits, I am referring to both both the
literal meaning and more specifically, actual Scala traits.
@pat, @tdunning, @ssc, could you give your thoughts on this?
As well, it would be good to add online flavors of different algorithm
classes into the mix.
@tdunning could you share some thoughts here?
Trevor Grant will be heading up this effort, and It would be great if we
all as a team could come up with abstract design plans for each class of
algorithm (as well as to determine the current "classes of algorithms", as
each of us has our own unique blend of specializations. And could give our
thoughts on this.
Currently this is really the opening of the conversation.
https://issues.apache.org/jira/browse/MAHOUT-1856
Any feedback is welcomed.
Thanks,
Andy

Dmitriy Lyubimov

2016-07-21 18:43:36 UTC

sk-learn learner, transformer and predictor features sound good to me,
tried-and-proven

most importantly imo we need strong established type system and not repeat
what i view as a problem in some other offerings. If the type system is
strict and limited in size, then there's much less need in data adapters,
or none at all.

so what we have :
-- double precison tensor types (but not n-d arrays though)
what we don't have:
-- data frames

What we may want to have
-- formula support, especially for non-linear glm ("linear generalized
linear", does this makes sense at all?) ok non-linear regressions
formula normally acts on data-frame-y data, not on tensor data, albeit it
produces tensor data. Herein lies a conundrum. I don't see mahout taking on
data frames, this is just too big. but good formula and "factor" (in R
sense) support is nice to have for down-to-earth problems.

perhaps a tactical solution here is to integrate some foreign engine data
frames but mahout native formula support. But i didn't give it much
thought, because, although formulas and step-wise non-linear model searches
are the first thing to happen to any analytics (but somehow it hasn't
happened well enough elsewhere), i don't see how it can be made cheaply in
engine-agnostic way. I still commonly view mahout as an under-funded
project, so choices of new things should be smart -- small in volume, great
in the bang. Dataframes are not small in the volume, esp. since i am
increasingly turning away from Spark in my personal endeavors, so i won't
support just integrating sparkql for this purpose.

Big area that people actually need (IMO) and what hasn't been done well
elsewhere (IMO) are model and model parameter searches. This "ML optimizer"
idea that has been in AMPLab for as long as i remember them, and is still
very popular, but I don't think there are good offers that actually solve
this problem in OSS. One of the reasons, modern OSS is pretty slow for the
volume required by the task. if we get some unique improvements to the
framework, we can think of getting in this business. this shouldn't be that
much difficult, assuming the throughput is not an issue. GPU clusters are
increasingly common, we can hope we'll get there in the future.

on algorithm side, i would love to see something with 2d inputs, cnns or
something, for image processing.

Post by Trevor Grant
I was thinking so too. Most ML frameworks are at least loosly based on the
Sklearn paradigm. For those not familiar, at a very abstract level-
model1 = new Algo // e.g. K-Means, Random Forest, Neural Net
model1.fit(trainingData)
// then depending on the goal of the algorithm you have either (or both)
preds = model1.predict( testData) // which returns a vector of predictions
for each obs point in testing data
// or sometimes
newVals = model1.transform( testData) // which returns a new dataset like
object, as this makes more sense for things like neural nets, or when
you're not just predicting a single value per observation
In addition to the above, pre-processing operations then also have a
transform method such as
preprocess1 = new Normalizer
preprocess1.fit( trainingData ) // in this phase calculates the mean and
variance of the training data set
preprocessedTrainingData = preprocess1.transform( trainingData)
preprocessTestingData = preprocess1.transform( testingData)
I think this is a reasonalbe approach bc A) it makes sense and B) is a
standard of sorts across ML libraries (bc of A)
Predictors and Transformers
Predictors: anything that return a single value per observation, this is
classifiers and regressors
Transformers: anything that returns a vector per observation
- Pre-processing operations
- Classifiers, in that usually there is a probability vector for each
observation as to which class it belongs too, the 'predict' method then
just picks the most likely class
- Neural nets ( though with one small tweak can be extended to regression
or classification )
- Any unsupervised learning application (e.g. clustering)
- etc.
class LearningFunction
def fit()
def transform
def predict
This paradigm also lends its self nicely to pipelines...
pipeline1 = new Pipeline
.add( transformer1 )
.add( transformer2 )
.add( model1 )
pipeline1.fit( trainingData )
pipelin1.predict( testingData )
I have to read up on reccomenders a bit more to figure how those play in,
or if we need another class.
In addition to that I think we would have an optimizers section that allows
for the various flavors of SGD, but also allows other types of optimizers
all together.
Again, just moving the conversation forward a bit here.
Excited to get to work on this
Best,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*

Hi All,
https://issues.apache.org/jira/browse/MAHOUT-1856
This is a discussion that has popped up several times over the last
couple of years. as we move towards building out our algorithm library,

would be great to nail this down now.
Most Importantly to not be able to be criticized as "a loose bag of
algorithms" as we've sometimes been in the past.
The main point being It would be good to lay out common traits for
Classification, Clustering, and Optimization algorithms.
This is just a start. I created this issue a few months back, and
intentionally left off Recommender, because I was unsure if there were
common traits across them. By traits, I am referring to both both the
literal meaning and more specifically, actual Scala traits.
@pat, @tdunning, @ssc, could you give your thoughts on this?
As well, it would be good to add online flavors of different algorithm
classes into the mix.
@tdunning could you share some thoughts here?
Trevor Grant will be heading up this effort, and It would be great if we
all as a team could come up with abstract design plans for each class of
algorithm (as well as to determine the current "classes of algorithms",

each of us has our own unique blend of specializations. And could give

our

thoughts on this.
Currently this is really the opening of the conversation.
https://issues.apache.org/jira/browse/MAHOUT-1856
Any feedback is welcomed.
Thanks,
Andy

Trevor Grant

2016-07-21 19:35:01 UTC

+1

The sklearn paradigm I think is awesome as an API, but I'm not looking to
make sklearn for Spark. To Dmitriy's first point (correct me if I
extrapolating incorrectly), every underlying engine already has a SGD
Regression, K-Means, and a couple other standbys. They take no time to
build, but why? If the user wants them, they can use them in the native
engine (or we can slap them in there just cause).

Let's (aim to) differentiate by providing useful algorithms not already
shipped standard in every other ML package on the block.

Another 'algorithm' that is used very widely in every industry I've been in
( Marketing and CPG ), that doesn't have a pleasant 'Big Data' solution is
hierarchical models (also called mix-models). There's a bunch of other
'daily drivers' that everyone already use in R/SAS/ etc. that just don't
scale well, thus the rise of SGD, and Big Data algos. Mahout is the ML
library for people who actually know math IMHO, in contrast to others that
are ML for computer scientists. Let's expose some algorithms that single
node analysts know and are comfortable with.

So OLS isn't as efficient as SGD... so what. An analyst can pick up
Mahout, and migrate their old methods into a distributed environment.
Further, they can see t-scores and f-scores and chi tests all those
statistics that everyone has come to know an love. I think that would be a
huge win, as it erases this idea that if you're going to work in big data
you must abandon the old ways.

To Dmitriy's last point- the sklearn equivelent of that:
http://scikit-learn.org/stable/modules/grid_search.html

I agree 100%, it's something I truly miss about sklearn. I'd support
implementing those 'everyone has one' algos from paragraph 1 if that was
the end goal.

Finally, re data-frames. Why not leave it as vectors and matrices? That is
a more R-Like thing to do anyway.

val X: Matrix= data
val y: Vector = labels

model1.fit(X, y)

I don't mean to dominate the conversation, and I'm sorry- but I really
wanted to toss that idea re: hierarchical models out there, bc I know lots
of people who would love to have them, and it is the thing keeping them on
single core machines at the moment.

tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

Post by Dmitriy Lyubimov
sk-learn learner, transformer and predictor features sound good to me,
tried-and-proven
most importantly imo we need strong established type system and not repeat
what i view as a problem in some other offerings. If the type system is
strict and limited in size, then there's much less need in data adapters,
or none at all.
-- double precison tensor types (but not n-d arrays though)
-- data frames
What we may want to have
-- formula support, especially for non-linear glm ("linear generalized
linear", does this makes sense at all?) ok non-linear regressions
formula normally acts on data-frame-y data, not on tensor data, albeit it
produces tensor data. Herein lies a conundrum. I don't see mahout taking on
data frames, this is just too big. but good formula and "factor" (in R
sense) support is nice to have for down-to-earth problems.
perhaps a tactical solution here is to integrate some foreign engine data
frames but mahout native formula support. But i didn't give it much
thought, because, although formulas and step-wise non-linear model searches
are the first thing to happen to any analytics (but somehow it hasn't
happened well enough elsewhere), i don't see how it can be made cheaply in
engine-agnostic way. I still commonly view mahout as an under-funded
project, so choices of new things should be smart -- small in volume, great
in the bang. Dataframes are not small in the volume, esp. since i am
increasingly turning away from Spark in my personal endeavors, so i won't
support just integrating sparkql for this purpose.
Big area that people actually need (IMO) and what hasn't been done well
elsewhere (IMO) are model and model parameter searches. This "ML optimizer"
idea that has been in AMPLab for as long as i remember them, and is still
very popular, but I don't think there are good offers that actually solve
this problem in OSS. One of the reasons, modern OSS is pretty slow for the
volume required by the task. if we get some unique improvements to the
framework, we can think of getting in this business. this shouldn't be that
much difficult, assuming the throughput is not an issue. GPU clusters are
increasingly common, we can hope we'll get there in the future.
on algorithm side, i would love to see something with 2d inputs, cnns or
something, for image processing.

Post by Trevor Grant
I was thinking so too. Most ML frameworks are at least loosly based on

the

Post by Trevor Grant
Sklearn paradigm. For those not familiar, at a very abstract level-
model1 = new Algo // e.g. K-Means, Random Forest, Neural Net
model1.fit(trainingData)
// then depending on the goal of the algorithm you have either (or both)
preds = model1.predict( testData) // which returns a vector of

predictions

Post by Trevor Grant
for each obs point in testing data
// or sometimes
newVals = model1.transform( testData) // which returns a new dataset like
object, as this makes more sense for things like neural nets, or when
you're not just predicting a single value per observation
In addition to the above, pre-processing operations then also have a
transform method such as
preprocess1 = new Normalizer
preprocess1.fit( trainingData ) // in this phase calculates the mean and
variance of the training data set
preprocessedTrainingData = preprocess1.transform( trainingData)
preprocessTestingData = preprocess1.transform( testingData)
I think this is a reasonalbe approach bc A) it makes sense and B) is a
standard of sorts across ML libraries (bc of A)
Predictors and Transformers
Predictors: anything that return a single value per observation, this is
classifiers and regressors
Transformers: anything that returns a vector per observation
- Pre-processing operations
- Classifiers, in that usually there is a probability vector for each
observation as to which class it belongs too, the 'predict' method then
just picks the most likely class
- Neural nets ( though with one small tweak can be extended to regression
or classification )
- Any unsupervised learning application (e.g. clustering)
- etc.
class LearningFunction
def fit()
def transform
def predict
This paradigm also lends its self nicely to pipelines...
pipeline1 = new Pipeline
.add( transformer1 )
.add( transformer2 )
.add( model1 )
pipeline1.fit( trainingData )
pipelin1.predict( testingData )
I have to read up on reccomenders a bit more to figure how those play in,
or if we need another class.
In addition to that I think we would have an optimizers section that

allows

Post by Trevor Grant
for the various flavors of SGD, but also allows other types of optimizers
all together.
Again, just moving the conversation forward a bit here.
Excited to get to work on this
Best,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*

Post by Sebastian
Hi Andrew,
I think this topic is broader than just defining a few traits. A

popular

Post by Sebastian
way of integrating ML algorithms is via the combination of dataframes

and

Post by Sebastian
pipelines, similar to what scipy and SparkML are offering at the

moment.

Post by Sebastian
Maybe it could make sense to integrate with what they have instead of
starting our own efforts?
Best,
Sebastian

Hi All,
https://issues.apache.org/jira/browse/MAHOUT-1856
This is a discussion that has popped up several times over the last
couple of years. as we move towards building out our algorithm

library,

Post by Trevor Grant
It

all as a team could come up with abstract design plans for each class

algorithm (as well as to determine the current "classes of

algorithms",

Post by Trevor Grant
as

each of us has our own unique blend of specializations. And could

give

Post by Trevor Grant
our

thoughts on this.
Currently this is really the opening of the conversation.
https://issues.apache.org/jira/browse/MAHOUT-1856
Any feedback is welcomed.
Thanks,
Andy

Dmitriy Lyubimov

2016-07-21 20:47:26 UTC