Traits for a mahout algorithm Library.
2016-07-21 12:13:18 UTC
Hi Andrew,

I think this topic is broader than just defining a few traits. A popular
way of integrating ML algorithms is via the combination of dataframes
and pipelines, similar to what scipy and SparkML are offering at the
moment. Maybe it could make sense to integrate with what they have
instead of starting our own efforts?

Hi All,
I'd like to draw your attention to MAHOUT-1856: https://issues.apache.org/jira/browse/MAHOUT-1856
This is a discussion that has popped up several times over the last couple of years. as we move towards building out our algorithm library, It would be great to nail this down now.
Most Importantly to not be able to be criticized as "a loose bag of algorithms" as we've sometimes been in the past.
The main point being It would be good to lay out common traits for Classification, Clustering, and Optimization algorithms.
This is just a start. I created this issue a few months back, and intentionally left off Recommender, because I was unsure if there were common traits across them. By traits, I am referring to both both the literal meaning and more specifically, actual Scala traits.
@pat, @tdunning, @ssc, could you give your thoughts on this?
As well, it would be good to add online flavors of different algorithm classes into the mix.
@tdunning could you share some thoughts here?
Trevor Grant will be heading up this effort, and It would be great if we all as a team could come up with abstract design plans for each class of algorithm (as well as to determine the current "classes of algorithms", as each of us has our own unique blend of specializations. And could give our thoughts on this.
Currently this is really the opening of the conversation.
It would be best to post thoughts on: https://issues.apache.org/jira/browse/MAHOUT-1856
Any feedback is welcomed.
Trevor Grant
2016-07-21 15:08:18 UTC
I was thinking so too. Most ML frameworks are at least loosly based on the
Sklearn paradigm. For those not familiar, at a very abstract level-

model1 = new Algo // e.g. K-Means, Random Forest, Neural Net


// then depending on the goal of the algorithm you have either (or both)
preds = model1.predict( testData) // which returns a vector of predictions
for each obs point in testing data

// or sometimes
newVals = model1.transform( testData) // which returns a new dataset like
object, as this makes more sense for things like neural nets, or when
you're not just predicting a single value per observation

In addition to the above, pre-processing operations then also have a
transform method such as

preprocess1 = new Normalizer

preprocess1.fit( trainingData ) // in this phase calculates the mean and
variance of the training data set

preprocessedTrainingData = preprocess1.transform( trainingData)
preprocessTestingData = preprocess1.transform( testingData)

I think this is a reasonalbe approach bc A) it makes sense and B) is a
standard of sorts across ML libraries (bc of A)

We have two high level bucket types, based on what the output is:

Predictors and Transformers

Predictors: anything that return a single value per observation, this is
classifiers and regressors

Transformers: anything that returns a vector per observation
- Pre-processing operations
- Classifiers, in that usually there is a probability vector for each
observation as to which class it belongs too, the 'predict' method then
just picks the most likely class
- Neural nets ( though with one small tweak can be extended to regression
or classification )
- Any unsupervised learning application (e.g. clustering)
- etc.

And so really we have something like:

class LearningFunction
def fit()

class Transformer extends LearningFunction:
def transform

class Predictor extends Transformer:
def predict

This paradigm also lends its self nicely to pipelines...

pipeline1 = new Pipeline
.add( transformer1 )
.add( transformer2 )
.add( model1 )

pipeline1.fit( trainingData )
pipelin1.predict( testingData )

I have to read up on reccomenders a bit more to figure how those play in,
or if we need another class.

In addition to that I think we would have an optimizers section that allows
for the various flavors of SGD, but also allows other types of optimizers
all together.

Again, just moving the conversation forward a bit here.

Excited to get to work on this



Trevor Grant
Data Scientist

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Sebastian
Hi Andrew,
I think this topic is broader than just defining a few traits. A popular
way of integrating ML algorithms is via the combination of dataframes and
pipelines, similar to what scipy and SparkML are offering at the moment.
Maybe it could make sense to integrate with what they have instead of
starting our own efforts?
Hi All,
This is a discussion that has popped up several times over the last
couple of years. as we move towards building out our algorithm library, It
would be great to nail this down now.
Most Importantly to not be able to be criticized as "a loose bag of
algorithms" as we've sometimes been in the past.
The main point being It would be good to lay out common traits for
Classification, Clustering, and Optimization algorithms.
This is just a start. I created this issue a few months back, and
intentionally left off Recommender, because I was unsure if there were
common traits across them. By traits, I am referring to both both the
literal meaning and more specifically, actual Scala traits.
@pat, @tdunning, @ssc, could you give your thoughts on this?
As well, it would be good to add online flavors of different algorithm
classes into the mix.
@tdunning could you share some thoughts here?
Trevor Grant will be heading up this effort, and It would be great if we
all as a team could come up with abstract design plans for each class of
algorithm (as well as to determine the current "classes of algorithms", as
each of us has our own unique blend of specializations. And could give our
thoughts on this.
Currently this is really the opening of the conversation.
Any feedback is welcomed.
Dmitriy Lyubimov
2016-07-21 18:43:36 UTC
sk-learn learner, transformer and predictor features sound good to me,

most importantly imo we need strong established type system and not repeat
what i view as a problem in some other offerings. If the type system is
strict and limited in size, then there's much less need in data adapters,
or none at all.

so what we have :
-- double precison tensor types (but not n-d arrays though)
what we don't have:
-- data frames

What we may want to have
-- formula support, especially for non-linear glm ("linear generalized
linear", does this makes sense at all?) ok non-linear regressions
formula normally acts on data-frame-y data, not on tensor data, albeit it
produces tensor data. Herein lies a conundrum. I don't see mahout taking on
data frames, this is just too big. but good formula and "factor" (in R
sense) support is nice to have for down-to-earth problems.

perhaps a tactical solution here is to integrate some foreign engine data
frames but mahout native formula support. But i didn't give it much
thought, because, although formulas and step-wise non-linear model searches
are the first thing to happen to any analytics (but somehow it hasn't
happened well enough elsewhere), i don't see how it can be made cheaply in
engine-agnostic way. I still commonly view mahout as an under-funded
project, so choices of new things should be smart -- small in volume, great
in the bang. Dataframes are not small in the volume, esp. since i am
increasingly turning away from Spark in my personal endeavors, so i won't
support just integrating sparkql for this purpose.

Big area that people actually need (IMO) and what hasn't been done well
elsewhere (IMO) are model and model parameter searches. This "ML optimizer"
idea that has been in AMPLab for as long as i remember them, and is still
very popular, but I don't think there are good offers that actually solve
this problem in OSS. One of the reasons, modern OSS is pretty slow for the
volume required by the task. if we get some unique improvements to the
framework, we can think of getting in this business. this shouldn't be that
much difficult, assuming the throughput is not an issue. GPU clusters are
increasingly common, we can hope we'll get there in the future.

on algorithm side, i would love to see something with 2d inputs, cnns or
something, for image processing.
Post by Trevor Grant
I was thinking so too. Most ML frameworks are at least loosly based on the
Sklearn paradigm. For those not familiar, at a very abstract level-
model1 = new Algo // e.g. K-Means, Random Forest, Neural Net
// then depending on the goal of the algorithm you have either (or both)
preds = model1.predict( testData) // which returns a vector of predictions
for each obs point in testing data
// or sometimes
newVals = model1.transform( testData) // which returns a new dataset like
object, as this makes more sense for things like neural nets, or when
you're not just predicting a single value per observation
In addition to the above, pre-processing operations then also have a
transform method such as
preprocess1 = new Normalizer
preprocess1.fit( trainingData ) // in this phase calculates the mean and
variance of the training data set
preprocessedTrainingData = preprocess1.transform( trainingData)
preprocessTestingData = preprocess1.transform( testingData)
I think this is a reasonalbe approach bc A) it makes sense and B) is a
standard of sorts across ML libraries (bc of A)
Predictors and Transformers
Predictors: anything that return a single value per observation, this is
classifiers and regressors
Transformers: anything that returns a vector per observation
- Pre-processing operations
- Classifiers, in that usually there is a probability vector for each
observation as to which class it belongs too, the 'predict' method then
just picks the most likely class
- Neural nets ( though with one small tweak can be extended to regression
or classification )
- Any unsupervised learning application (e.g. clustering)
- etc.
class LearningFunction
def fit()
def transform
def predict
This paradigm also lends its self nicely to pipelines...
pipeline1 = new Pipeline
.add( transformer1 )
.add( transformer2 )
.add( model1 )
pipeline1.fit( trainingData )
pipelin1.predict( testingData )
I have to read up on reccomenders a bit more to figure how those play in,
or if we need another class.
In addition to that I think we would have an optimizers section that allows
for the various flavors of SGD, but also allows other types of optimizers
all together.
Again, just moving the conversation forward a bit here.
Excited to get to work on this
Trevor Grant
Data Scientist
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Sebastian
Hi Andrew,
I think this topic is broader than just defining a few traits. A popular
way of integrating ML algorithms is via the combination of dataframes and
pipelines, similar to what scipy and SparkML are offering at the moment.
Maybe it could make sense to integrate with what they have instead of
starting our own efforts?
Hi All,
This is a discussion that has popped up several times over the last
couple of years. as we move towards building out our algorithm library,
Post by Sebastian
would be great to nail this down now.
Most Importantly to not be able to be criticized as "a loose bag of
algorithms" as we've sometimes been in the past.
The main point being It would be good to lay out common traits for
Classification, Clustering, and Optimization algorithms.
This is just a start. I created this issue a few months back, and
intentionally left off Recommender, because I was unsure if there were
common traits across them. By traits, I am referring to both both the
literal meaning and more specifically, actual Scala traits.
@pat, @tdunning, @ssc, could you give your thoughts on this?
As well, it would be good to add online flavors of different algorithm
classes into the mix.
@tdunning could you share some thoughts here?
Trevor Grant will be heading up this effort, and It would be great if we
all as a team could come up with abstract design plans for each class of
algorithm (as well as to determine the current "classes of algorithms",
Post by Sebastian
each of us has our own unique blend of specializations. And could give
Post by Sebastian
thoughts on this.
Currently this is really the opening of the conversation.
Any feedback is welcomed.
Trevor Grant
2016-07-21 19:35:01 UTC

The sklearn paradigm I think is awesome as an API, but I'm not looking to
make sklearn for Spark. To Dmitriy's first point (correct me if I
extrapolating incorrectly), every underlying engine already has a SGD
Regression, K-Means, and a couple other standbys. They take no time to
build, but why? If the user wants them, they can use them in the native
engine (or we can slap them in there just cause).

Let's (aim to) differentiate by providing useful algorithms not already
shipped standard in every other ML package on the block.

Another 'algorithm' that is used very widely in every industry I've been in
( Marketing and CPG ), that doesn't have a pleasant 'Big Data' solution is
hierarchical models (also called mix-models). There's a bunch of other
'daily drivers' that everyone already use in R/SAS/ etc. that just don't
scale well, thus the rise of SGD, and Big Data algos. Mahout is the ML
library for people who actually know math IMHO, in contrast to others that
are ML for computer scientists. Let's expose some algorithms that single
node analysts know and are comfortable with.

So OLS isn't as efficient as SGD... so what. An analyst can pick up
Mahout, and migrate their old methods into a distributed environment.
Further, they can see t-scores and f-scores and chi tests all those
statistics that everyone has come to know an love. I think that would be a
huge win, as it erases this idea that if you're going to work in big data
you must abandon the old ways.

To Dmitriy's last point- the sklearn equivelent of that:

I agree 100%, it's something I truly miss about sklearn. I'd support
implementing those 'everyone has one' algos from paragraph 1 if that was
the end goal.

Finally, re data-frames. Why not leave it as vectors and matrices? That is
a more R-Like thing to do anyway.

val X: Matrix= data
val y: Vector = labels

model1.fit(X, y)

I don't mean to dominate the conversation, and I'm sorry- but I really
wanted to toss that idea re: hierarchical models out there, bc I know lots
of people who would love to have them, and it is the thing keeping them on
single core machines at the moment.


Trevor Grant
Data Scientist

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Dmitriy Lyubimov
sk-learn learner, transformer and predictor features sound good to me,
most importantly imo we need strong established type system and not repeat
what i view as a problem in some other offerings. If the type system is
strict and limited in size, then there's much less need in data adapters,
or none at all.
-- double precison tensor types (but not n-d arrays though)
-- data frames
What we may want to have
-- formula support, especially for non-linear glm ("linear generalized
linear", does this makes sense at all?) ok non-linear regressions
formula normally acts on data-frame-y data, not on tensor data, albeit it
produces tensor data. Herein lies a conundrum. I don't see mahout taking on
data frames, this is just too big. but good formula and "factor" (in R
sense) support is nice to have for down-to-earth problems.
perhaps a tactical solution here is to integrate some foreign engine data
frames but mahout native formula support. But i didn't give it much
thought, because, although formulas and step-wise non-linear model searches
are the first thing to happen to any analytics (but somehow it hasn't
happened well enough elsewhere), i don't see how it can be made cheaply in
engine-agnostic way. I still commonly view mahout as an under-funded
project, so choices of new things should be smart -- small in volume, great
in the bang. Dataframes are not small in the volume, esp. since i am
increasingly turning away from Spark in my personal endeavors, so i won't
support just integrating sparkql for this purpose.
Big area that people actually need (IMO) and what hasn't been done well
elsewhere (IMO) are model and model parameter searches. This "ML optimizer"
idea that has been in AMPLab for as long as i remember them, and is still
very popular, but I don't think there are good offers that actually solve
this problem in OSS. One of the reasons, modern OSS is pretty slow for the
volume required by the task. if we get some unique improvements to the
framework, we can think of getting in this business. this shouldn't be that
much difficult, assuming the throughput is not an issue. GPU clusters are
increasingly common, we can hope we'll get there in the future.
on algorithm side, i would love to see something with 2d inputs, cnns or
something, for image processing.
Post by Trevor Grant
I was thinking so too. Most ML frameworks are at least loosly based on
Post by Trevor Grant
Sklearn paradigm. For those not familiar, at a very abstract level-
model1 = new Algo // e.g. K-Means, Random Forest, Neural Net
// then depending on the goal of the algorithm you have either (or both)
preds = model1.predict( testData) // which returns a vector of
Post by Trevor Grant
for each obs point in testing data
// or sometimes
newVals = model1.transform( testData) // which returns a new dataset like
object, as this makes more sense for things like neural nets, or when
you're not just predicting a single value per observation
In addition to the above, pre-processing operations then also have a
transform method such as
preprocess1 = new Normalizer
preprocess1.fit( trainingData ) // in this phase calculates the mean and
variance of the training data set
preprocessedTrainingData = preprocess1.transform( trainingData)
preprocessTestingData = preprocess1.transform( testingData)
I think this is a reasonalbe approach bc A) it makes sense and B) is a
standard of sorts across ML libraries (bc of A)
Predictors and Transformers
Predictors: anything that return a single value per observation, this is
classifiers and regressors
Transformers: anything that returns a vector per observation
- Pre-processing operations
- Classifiers, in that usually there is a probability vector for each
observation as to which class it belongs too, the 'predict' method then
just picks the most likely class
- Neural nets ( though with one small tweak can be extended to regression
or classification )
- Any unsupervised learning application (e.g. clustering)
- etc.
class LearningFunction
def fit()
def transform
def predict
This paradigm also lends its self nicely to pipelines...
pipeline1 = new Pipeline
.add( transformer1 )
.add( transformer2 )
.add( model1 )
pipeline1.fit( trainingData )
pipelin1.predict( testingData )
I have to read up on reccomenders a bit more to figure how those play in,
or if we need another class.
In addition to that I think we would have an optimizers section that
Post by Trevor Grant
for the various flavors of SGD, but also allows other types of optimizers
all together.
Again, just moving the conversation forward a bit here.
Excited to get to work on this
Trevor Grant
Data Scientist
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Sebastian
Hi Andrew,
I think this topic is broader than just defining a few traits. A
Post by Trevor Grant
Post by Sebastian
way of integrating ML algorithms is via the combination of dataframes
Post by Trevor Grant
Post by Sebastian
pipelines, similar to what scipy and SparkML are offering at the
Post by Trevor Grant
Post by Sebastian
Maybe it could make sense to integrate with what they have instead of
starting our own efforts?
Hi All,
This is a discussion that has popped up several times over the last
couple of years. as we move towards building out our algorithm
Post by Trevor Grant
Post by Sebastian
would be great to nail this down now.
Most Importantly to not be able to be criticized as "a loose bag of
algorithms" as we've sometimes been in the past.
The main point being It would be good to lay out common traits for
Classification, Clustering, and Optimization algorithms.
This is just a start. I created this issue a few months back, and
intentionally left off Recommender, because I was unsure if there were
common traits across them. By traits, I am referring to both both the
literal meaning and more specifically, actual Scala traits.
@pat, @tdunning, @ssc, could you give your thoughts on this?
As well, it would be good to add online flavors of different algorithm
classes into the mix.
@tdunning could you share some thoughts here?
Trevor Grant will be heading up this effort, and It would be great if
Post by Trevor Grant
Post by Sebastian
all as a team could come up with abstract design plans for each class
Post by Trevor Grant
Post by Sebastian
algorithm (as well as to determine the current "classes of
Post by Trevor Grant
Post by Sebastian
each of us has our own unique blend of specializations. And could
Post by Trevor Grant
Post by Sebastian
thoughts on this.
Currently this is really the opening of the conversation.
Any feedback is welcomed.
Dmitriy Lyubimov
2016-07-21 20:47:26 UTC
Post by Trevor Grant
Finally, re data-frames. Why not leave it as vectors and matrices?
Short answer: because (imo) data frames are not vectors and matrices.

Longer argumentation:

Some capabilities expected of data frames are as follows.

DFs are columnar tables where columns are either named vectors or named
factors (in R sense).

Also, operationally DFs are usually more leaning on providing relational
algebra capabilities (joins etc.) than on numerical algebra (blas3).

A factor (or, perhaps a better term, a categorical feature) is
fundamentally a non-numerical data. It's representation of a categorical
data which could be bounded or unbounded in number of categories.

Further more, there is more than one way to vectorize a factor or a group
of factors, which is what formula and other things are called for doing.

Now you might view all these formulas, factors and hash tricks as feature
preparation activity and say that learning process is not bothered by that.
In the end, every fitting is essentially working on a numerical input.

That's unfortunately may not be quite true.

Model search (step-wise GLM, for esxample) is not necessarily a
numerical-only thing since it essentially manages factor vectorization.

That said, i think we can safely say that individual learner could be a
numerical-only thing. But as soon as we go up the chain to transformations,
vectorizations and searching for parameters of vectorizations, dataframes
are usually input sources for all those.

excellent example of those (which was failed to get properly architected by
concerns in that another OSS project) is implicit feedback recommender.

In fact, there are two problems here -- one is parameterized feature
extraction and another is fitting the decomposition.

each of the problems have its own parameters. In vanilla paper
implementation there were two suggested ways of feature extraction that
offered one parameter each, and then were suggested to be searched for via
CV along with the fitter hyperparameters (learning rate, regularization).

What it means is that hyperparameter search may overarch feature extraction
_and_ fitting and essentially may require a data frame as an input in most
general case (and i ran into such practical case before).

Finally, some goodness of fit metrics work on pre-vectorized factors.

This is all standard but it is all pretty expensive to do unfortunately. I
have big problem discarding notion of dataframe support as part of the
fitting/search process for some areas of computational statistics.