Discussion:
[jira] [Created] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
Andrew Palumbo (JIRA)
2016-05-09 22:19:12 UTC
Permalink
Andrew Palumbo created MAHOUT-1856:
--------------------------------------

Summary: Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Fix For: 0.13.0


To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Andrew Palumbo (JIRA)
2016-05-23 20:45:12 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Palumbo updated MAHOUT-1856:
-----------------------------------
Priority: Critical (was: Major)
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Andrew Palumbo (JIRA)
2016-05-23 20:48:12 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Palumbo updated MAHOUT-1856:
-----------------------------------
Affects Version/s: (was: 0.12.0)
0.12.1
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Trevor Grant (JIRA)
2016-07-20 22:42:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Trevor Grant reassigned MAHOUT-1856:
------------------------------------

Assignee: Trevor Grant
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-08-04 15:51:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407985#comment-15407985 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

GitHub user rawkintrevo opened a pull request:

https://github.com/apache/mahout/pull/246

[MAHOUT-1856][WIP] reate a framework for new Mahout Clustering, Classification, and Optimization Algorithms

Relevant JIRA: [https://issues.apache.org/jira/browse/MAHOUT-1856](https://issues.apache.org/jira/browse/MAHOUT-1856)

Readme.md provides a more comprehensive (yet still incomplete) overview.

Key Points:
Top Level Class:
Model has one method- fit, and coefs.

Transformers map a vector input to a vector output (same or different length)
Regressors map a vector input to a single output (e.g. a Double)
Classifiers extend Transformers which have created a probability vector by 'selecting' the class and returning the label (instead of the entire p-vector)

Pipelines and Ensembles are models as well, except they are composed from other models listed above, or from other pipelines and ensembles.

ToDo:
- [ ] All models need a uniform way to expose their tuning parameters -> this will be required for a auto-tuning algo.
- [ ] Pipelines / Ensembles must be able to account and report the tunable paremeters of their sub models
- [ ] Need fitness functions
- [ ] Native method wrappers- Underlying engines and third party packages have implementations of many ML models, let's not recreate the wheel by exposing YET ANOTHER sgd algorithm. Instead should be able to convert matrix to expected format of 'other' library, run model, get results, package back into matrix and pass on in pipeline or ensemble. (This is especially useful for DeepLearning4J integration). Also Native implementations on engine of some algos probably more efficient by leveraging engine specific tricks (think Flink delta iterators) than implementations we would make.
- [ ] Lots more, open for discussion.

This is merely a conversation started on what to do.

I've included OLS as an example regressor and a normalizer as an example transformer, only for illustrative purposes. I really don't want to pack to many algos in to this initial commit, just an example/ proof of concept so we can say, yea- this framework makes sense for this kind of model OR ooh, we probably want to have these features too.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rawkintrevo/mahout mahout-1856

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mahout/pull/246.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #246

----
commit 6c0f6bd322a50341bcc587750146467f9ff3fa0a
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-01T00:08:16Z

[MAHOUT-1856] ML Algo Framework

commit 1f04cd5436df12ded23b8a1815b93ce73ea2a32a
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-02T17:22:48Z

Building framework

commit 33b90c9795bbb1ff381a98045b0d5f2b641693a9
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-02T23:09:30Z

add placeholders for ensemble pipeline and fitness test

commit 83c6068e2aa18a62f6ae8b84169a018f764ab408
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-03T14:54:32Z

added readme

commit 52e9c3e1df4db1397ab81bf07c0e191cfd229b1a
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-03T14:58:59Z

fixed readme image

commit 92ceeb9603ff9c4927214b896c4dbcfc63f8c7c4
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-03T15:04:11Z

fixed readme image

commit c0b0464f45470375d709ef9475d474440411879f
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-03T15:04:52Z

fixed readme image

commit 6f0228aa7ff349cd8ff5c10a4dafe55ec2037ee4
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-04T15:36:53Z

removed autogen comments from files

commit 065fb24068e5e98b24f4f53ab8cb312abfb8b9ed
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-01T00:08:16Z

[MAHOUT-1856] ML Algo Framework

commit 127d5dec29ac8b7d6ad3a12c494d4ccdae24cd31
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-02T17:22:48Z

Building framework

commit 557af2ee7bec17b176c6def768ea6d3da8495b42
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-02T23:09:30Z

add placeholders for ensemble pipeline and fitness test

commit bde4c940f3e540ffb2e8eceb87355638ca157f89
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-03T14:54:32Z

added readme

commit 565a164082b3c00294db2a4bd1a0b001d561d6f9
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-03T14:58:59Z

fixed readme image

commit 950027c047021c23f44af64b842bcbc1bbd717f9
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-03T15:04:11Z

fixed readme image

commit 045192146e290d9762f09e4235dd4c2f947891d4
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-03T15:04:52Z

fixed readme image

commit f65d7a941f666d0a58d56ac642558dd15fb57cd7
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-04T15:36:53Z

removed autogen comments from files

commit 842db7ec3c21e5a4d1d152f1150b0dc97e5f44e7
Author: rawkintrevo <***@gmail.com>
Date: 2016-08-04T15:38:19Z

Merge branch 'mahout-1856' of https://github.com/rawkintrevo/mahout into mahout-1856

----
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-08-04 22:23:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408589#comment-15408589 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/246
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-08-05 01:11:20 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408731#comment-15408731 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/246

Nice work on the slides also! :100:

Another thought that I had is that we may want to allow In-Core matrices as parameters. Just throwing it out there for discuassion. I cant think of a particular use case off the top of my head but It seems that there should be.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2016-11-30 17:26:59 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15709182#comment-15709182 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user rawkintrevo commented on the issue:

https://github.com/apache/mahout/pull/246

This is WIP so it doesn't really matter that it's failing atm but [this](https://travis-ci.org/apache/mahout/builds/180124934#L176) isn't good.. Couldn't find maven on the `wget` command.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Dmitriy Lyubimov (JIRA)
2016-12-21 23:53:58 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15768521#comment-15768521 ]

Dmitriy Lyubimov commented on MAHOUT-1856:
------------------------------------------

one thing -- we usually squash working braches before moving a PR to master so that we preferrably have one commit per issue. this is much easier manage (and hot-fix stuff if needed later).
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Andrew Palumbo (JIRA)
2016-12-20 18:51:58 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Palumbo updated MAHOUT-1856:
-----------------------------------
Sprint: Jan/Feb-2017
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Andrew Palumbo (JIRA)
2016-12-31 22:03:58 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Palumbo reassigned MAHOUT-1856:
--------------------------------------

Assignee: Andrew Palumbo (was: Trevor Grant)

just setting to in- progress. will assign back to you [~rawkintrevo].
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Andrew Palumbo
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Andrew Palumbo (JIRA)
2016-12-31 22:03:58 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on MAHOUT-1856 started by Andrew Palumbo.
----------------------------------------------
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Andrew Palumbo
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Andrew Palumbo (JIRA)
2016-12-31 22:03:58 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Palumbo updated MAHOUT-1856:
-----------------------------------
Assignee: Trevor Grant (was: Andrew Palumbo)
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Andrew Palumbo (JIRA)
2017-01-16 02:28:28 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Palumbo updated MAHOUT-1856:
-----------------------------------
Sprint: Jan/Feb-2016 (was: Jan/Feb-2017)
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Andrew Palumbo (JIRA)
2017-01-16 02:28:29 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Palumbo updated MAHOUT-1856:
-----------------------------------
Sprint: Jan/Feb-2017 (was: Jan/Feb-2016)
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-16 04:35:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823453#comment-15823453 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/246

@dlyubimov, @mahout-team could you review/provide feedback on this? Originally Trevor had a separate module for this, and I asked him to move it into math-scala.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-16 22:38:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824680#comment-15824680 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/246
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-19 22:33:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830724#comment-15830724 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96473124

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,33 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+ * Abstract of Regressors
+ */
+abstract class Regressor extends Model {
+
+ def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]): Unit
--- End diff --

also maybe drmT for target ?
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-19 22:33:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830726#comment-15830726 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96468873

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/CochraneOrcutt.scala ---
@@ -0,0 +1,68 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+
+class CochraneOrcutt extends Regressor {
+ // https://en.wikipedia.org/wiki/Cochrane%E2%80%93Orcutt_estimation
+
+ var regressor : Regressor = new OLS() // type of regression to do- must have a 'beta' fit param
+ var iterations = 3 // Number of iterations to run
+
+ def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]) = {
+
+ regressor.fit(drmY, drmX)
+ fitParams("beta0") = regressor.fitParams("beta")
+
+ val Y = drmY(1 until drmY.nrow.toInt, 0 until 1).checkpoint()
--- End diff --

Consider giving the callee an option to specify cache hint here, since it seems essential that this algorithm relies on plenty of things being put into memory. right now this implies memory only, so if it doesn't fit then the algorithm is going to a crawl. (in all fairness, in spark it would go to a crawl with memory and disk spec too, but to put things in perspective, we are probably talking of a difference between crawling snail and snail skeleton 20 years after its death.)
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-19 22:33:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830725#comment-15830725 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96473064

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,33 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+ * Abstract of Regressors
+ */
+abstract class Regressor extends Model {
+
+ def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]): Unit
--- End diff --

perhaps first parameter being predictors and second parameter being target is more intuitive signature for most
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-19 22:33:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830723#comment-15830723 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96467982

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/Model.scala ---
@@ -0,0 +1,36 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms
+
+import org.apache.mahout.math.{Vector => MahoutVector, drm}
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.scalabindings._
+
+import scala.reflect.ClassTag
+
+abstract class Model extends Serializable {
+
+ var fitParams = collection.mutable.Map[String, MahoutVector]()
--- End diff --

So model requires all parameters be named and be vectors? Shouldn't this be an artifact of a more specialized models, like say glms? there are plenty of ML models that would probably not fit that fairly rigid definition, not easily or pragmatically, at least.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-19 22:43:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830738#comment-15830738 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96976147

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,33 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+ * Abstract of Regressors
+ */
+abstract class Regressor extends Model {
+
+ def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]): Unit
--- End diff --

also : i am not sure fitter should extend a model. Rather, fitter should return a model, i.e.,
`fit[k](...) : Model`, right? i think that's how this pattern goes in most kits. Fitter is just a startegy.

And i'd abstain from doing abstract classes in Scala, unless trait absolutely cannot do it. (and it can in this case). Abstract class points to a specific, single and necessary base implementation in hirerarchy, which is too constraining without need for the actual implementations.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-19 23:02:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830776#comment-15830776 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96979374

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,33 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+ * Abstract of Regressors
+ */
+abstract class Regressor extends Model {
+
+ def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]): Unit
--- End diff --

+1 on swapping Y and X- I've had to catch myself more than once on that already. I think the original motivation was a tip of the hat to R's `Y ~ x` but I agree with you.

Re `[Int]` I realized that later, but haven't gone through to swap them all back to `K`. It is (or was) `K` in some places.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-19 23:03:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830778#comment-15830778 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user rawkintrevo commented on the issue:

https://github.com/apache/mahout/pull/246
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-19 23:22:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830810#comment-15830810 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96982250

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,33 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+ * Abstract of Regressors
+ */
+abstract class Regressor extends Model {
+
+ def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]): Unit
--- End diff --

i guess if this is abstract enough, we also need to be able admit hyperparameters which are of course specific for every fitter. in R this is trivial (any call can be made a bag of whatever named arguments), but in Scala this may need a bit of a thought (if this abstraction needs to be that high). otherwise, i guess most scala kits just create a concrete fit signature per implementation.

if the Regressor trait is meant to be common to all possible regression class algorithms, we either need a way to universally pass in the hyperparameters, or just not have fit abstraction in the regressor trait at all . (then what i guess :) )
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-19 23:27:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830815#comment-15830815 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96982985

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,33 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+ * Abstract of Regressors
+ */
+abstract class Regressor extends Model {
+
+ def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]): Unit
--- End diff --

like perhaps `fit[K](X:drmLike[K],T:drmLike[K], (Symbol,Any)*):Model` where the optional list is hyperparameter list. Then hyperparameterized calls could be something like: ```
fit(X,Y, 'alpha ->10.0, 'lambda -> 1.e-15)
``` etc.
This loses strong type-iness of the signature but call is not that ugly, and it might be ok if specific implementations are cleanly documented .
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 22:11:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833721#comment-15833721 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97235509

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,37 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+ * Abstract of Regressors
+ */
+trait Regressor[K] extends Model {
+
+ var residuals: DrmLike[K] = _
+
+ def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Unit
--- End diff --

As well, `drmFeatures` is somewhat confusing, with the Drm being a matrix of samples x features, i.e. features are columns, I think something like `drmSamples`, or `drmObservations`, or even `drmX` may be more straightforward.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 22:11:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833722#comment-15833722 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97235704

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/OrdinaryLeastSquares.scala ---
@@ -0,0 +1,96 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+
+import scala.reflect.ClassTag
+
+/**
+ * import org.apache.mahout.math.algorithms.regression.OrdinaryLeastSquares
+ * val model = new OrdinaryLeastSquares()
+ *
+ * model.calcStandardErrors = true
+ *
+ */
+class OrdinaryLeastSquares[K](hyperparameters: Map[String, Any] = Map("" -> None)) extends LinearRegressor[K] {
+ // https://en.wikipedia.org/wiki/Ordinary_least_squares
+
+ var calcStandardErrors: Boolean = hyperparameters.asInstanceOf[Map[String, Boolean]].getOrElse("calcStandardErrors", true)
+ var addIntercept: Boolean = hyperparameters.asInstanceOf[Map[String, Boolean]].getOrElse("addIntercept", true)
+
+ var summary = ""
+ def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Unit = {
--- End diff --

Continuing on from our discussion on Slack I would think that Fit may be a more appropriate place for Hyperparameters eg:
```
fit(observed_independent: Drm[K], observerd_targets: Drm[K], hyperparamters: Option[List[double]]): List[double]
```
I think that this may be a matter of Convention, so If you're following a convention that I am not familiar with, this may be fine. However I feel that this may be more robust.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 22:11:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833723#comment-15833723 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97235398

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,37 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+ * Abstract of Regressors
+ */
+trait Regressor[K] extends Model {
+
+ var residuals: DrmLike[K] = _
+
+ def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Unit
--- End diff --

I would think that `fit(..)` should be able to return a List of errors per sample so possibly: ```def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Any``` would be a better signature for the trait.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 22:15:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833728#comment-15833728 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97235841

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/tests/AutocorrelationTests.scala ---
@@ -0,0 +1,54 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression.tests
+
+import org.apache.mahout.math.algorithms.regression.Regressor
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.function.Functions.SQUARE
+import org.apache.mahout.math.scalabindings.RLikeOps._
+
+
+object AutocorrelationTests {
+
+ //https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic
+ /*
+ To test for positive autocorrelation at significance α, the test statistic d is compared to lower and upper critical values (dL,α and dU,α):
+ If d < dL,α, there is statistical evidence that the error terms are positively autocorrelated.
+ If d > dU,α, there is no statistical evidence that the error terms are positively autocorrelated.
+ If dL,α < d < dU,α, the test is inconclusive.
+
+ Rule of Thumb:
+ d < 2 : positive auto-correlation
+ d = 2 : no auto-correlation
+ d > 2 : negative auto-correlation
+ */
+ def DurbinWaton[K](model: Regressor[K]): Regressor[K] = {
--- End diff --

misspelling- "DurbinWatson"
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 22:17:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833730#comment-15833730 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97235908

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/transformer/AsFactor.scala ---
@@ -0,0 +1,79 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.transformer
+
+import org.apache.mahout.math.{Vector => MahoutVector}
+
+import collection._
+import JavaConversions._
+import org.apache.mahout.math._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import scala.reflect.ClassTag
+
+class AsFactor extends Transformer{
+
+ var factorMap: MahoutVector = _
+ var k: MahoutVector = _
+ var summary = ""
+
+ def transform[K: ClassTag](input: DrmLike[K]): DrmLike[K] ={
+ if (!isFit) {
+ //throw an error
--- End diff --

Is this complete? I.e., do you want to throw an error here for this release?
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 22:19:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833732#comment-15833732 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97235959

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/transformer/MeanCenter.scala ---
@@ -0,0 +1,93 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.transformer
+
+import collection._
+import JavaConversions._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.{Matrix, Vector}
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.scalabindings.MahoutCollections._
+import org.apache.mahout.math.{Vector => MahoutVector}
+
+import scala.reflect.ClassTag
+
+class MeanCenter extends Transformer {
+
+ var summary = ""
+ var colMeansV: MahoutVector = _
+
+ /**
+ * Optionally set the centers of each column to some value other than Zero
+ * @param centers A vector of length equal to the `input` in the fit method specifying the
+ * centers to set each column to.
+ */
+ def setCenter(centers: MahoutVector) = {
+ colMeansV = colMeansV - centers
+ }
+
+ /**
+ * Centers Columns at zero
+ * @param input
+ */
+ def fit[K](input: DrmLike[K]) = {
+ colMeansV = input.colMeans
+ val colMeansA = colMeansV.toArray
+ //summary = (0 until colMeansA.length).map(i => s"Column ${i} mean: ${colMeansA(i)}").mkString(", ")
+ }
+
+ def transform[K: ClassTag](input: DrmLike[K]): DrmLike[K] = {
+
+ if (!isFit) {
+ //throw an error
+ }
+
+ implicit val ctx = input.context
+ val bcastV = drmBroadcast(colMeansV)
+
+ val output = input.mapBlock(input.ncol) {
+ case (keys, block) =>
+ val copy: Matrix = block.cloned
+ copy.foreach(row => row -= bcastV.value)
+ (keys, copy)
+ }
+ output
+ }
+
+ def invTransform[K: ClassTag](input: DrmLike[K]): DrmLike[K] = {
+
+ if (!isFit) {
+ //throw an error
--- End diff --

Same qestion here as before- do you want to throw an error here?
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 22:33:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833739#comment-15833739 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97236451

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/transformer/StandardScaler.scala ---
@@ -0,0 +1,119 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.transformer
+
+import org.apache.mahout.math.drm
+
+import org.apache.mahout.math.scalabindings._
+
+import org.apache.mahout.math.scalabindings.RLikeVectorOps
+import org.apache.mahout.math.{Vector => MahoutVector}
+
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.scalabindings.RLikeVectorOps
+import org.apache.mahout.math.scalabindings.MatrixOps
+
+import org.apache.mahout.math._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.drm.RLikeDrmOps._
+
+
+import org.apache.mahout.math.Matrix
+
+import collection._
+import JavaConversions._
+
+import Math.sqrt
+
+import scala.reflect.{ClassTag,classTag}
+
+/**
+ * Scales columns to mean 0 and unit variance
+ */
+class StandardScaler extends Transformer{
+ var meanVec: MahoutVector = _
+ var variance: MahoutVector = _
+ var stdev: MahoutVector = _
+ var summary = ""
+
+ def fit[K](input: DrmLike[K]) = {
+ val mNv = dcolMeanVars(input)
+ meanVec = mNv._1
+ variance = mNv._2
+ stdev = mNv._2.sqrt
+ isFit = true
+
+ }
+
+ def transform[K: ClassTag](input: DrmLike[K]): DrmLike[K] = {
+
+ if (!isFit) {
+ //throw an error
--- End diff --

I find this a bit confusing (as well as the other mentions of 'if (!isFit) {// throw }. So if i want to standardize a Drm to (mean = 0,std_dev = 1), I would need to do something like:
```
drmStandardized = StandardScaler(unscaledDrm).fit().transform()
```
?
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 22:47:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833743#comment-15833743 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97236955

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/transformer/StandardScaler.scala ---
@@ -0,0 +1,119 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.transformer
+
+import org.apache.mahout.math.drm
+
+import org.apache.mahout.math.scalabindings._
+
+import org.apache.mahout.math.scalabindings.RLikeVectorOps
+import org.apache.mahout.math.{Vector => MahoutVector}
+
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.scalabindings.RLikeVectorOps
+import org.apache.mahout.math.scalabindings.MatrixOps
+
+import org.apache.mahout.math._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.drm.RLikeDrmOps._
+
+
+import org.apache.mahout.math.Matrix
+
+import collection._
+import JavaConversions._
+
+import Math.sqrt
+
+import scala.reflect.{ClassTag,classTag}
+
+/**
+ * Scales columns to mean 0 and unit variance
+ */
+class StandardScaler extends Transformer{
+ var meanVec: MahoutVector = _
+ var variance: MahoutVector = _
+ var stdev: MahoutVector = _
+ var summary = ""
+
+ def fit[K](input: DrmLike[K]) = {
+ val mNv = dcolMeanVars(input)
+ meanVec = mNv._1
+ variance = mNv._2
+ stdev = mNv._2.sqrt
+ isFit = true
+
+ }
+
+ def transform[K: ClassTag](input: DrmLike[K]): DrmLike[K] = {
+
+ if (!isFit) {
+ //throw an error
--- End diff --

As well, I think that this could be another argument for moving Hyperparamaters into `fit(...)`, e.g. If for some reason we wanted to standardize on N(mean = 0, stdDev = 2), we could still call `StandardScaler` and `fit (Map["mu" -> 0, "sigma" ->2])`:
```
val drmStandardized = StandardScaler(unscaledDrm).fit(Map["mu" -> 0, "sigma" ->2]).transform()
```
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 22:49:27 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833744#comment-15833744 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97236994

--- Diff: math-scala/src/test/scala/org/apache/mahout/math/algorithms/RegressionSuite.scala ---
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.math.algorithms
+
+// arrange these proper
+import org.apache.mahout.math.algorithms.regression.OrdinaryLeastSquares
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.scalabindings.MahoutCollections._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.test.{DistributedMahoutSuite, MahoutSuite}
+import org.scalatest.{FunSuite, Matchers}
+
+trait RegressionSuite extends DistributedMahoutSuite with Matchers {
+ this: FunSuite =>
+
+ test("ordinary least squares") {
+ /*
+ R Prototype:
+ dataM <- matrix( c(2, 2, 10.5, 10, 29.509541,
+ 1, 2, 12, 12, 18.042851,
+ 1, 1, 12, 13, 22.736446,
+ 2, 1, 11, 13, 32.207582,
+ 1, 2, 12, 11, 21.871292,
+ 2, 1, 16, 8, 36.187559,
+ 6, 2, 17, 1, 50.764999,
+ 3, 2, 13, 7, 40.400208,
+ 3, 3, 13, 4, 45.811716), nrow=9, ncol=5, byrow=TRUE)
+
+
+ X = dataM[, c(1,2,3,4)]
+ y = dataM[, c(5)]
+
+ model <- lm(y ~ X - 1)
+ summary(model)
+
+ */
+
+ val drmData = drmParallelize(dense(
+ (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios
+ (1, 2, 12, 12, 18.042851), // Cap'n'Crunch
+ (1, 1, 12, 13, 22.736446), // Cocoa Puffs
+ (2, 1, 11, 13, 32.207582), // Froot Loops
+ (1, 2, 12, 11, 21.871292), // Honey Graham Ohs
+ (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold
+ (6, 2, 17, 1, 50.764999), // Cheerios
+ (3, 2, 13, 7, 40.400208), // Clusters
+ (3, 3, 13, 4, 45.811716)), numPartitions = 2)
+
+ drmData.collect(::, 0 until 4)
+
+ val drmX = drmData(::, 0 until 4)
+ val drmY = drmData(::, 4 until 5)
+
+ val model = new OrdinaryLeastSquares[Int]()
+ model.fit(drmY, drmX)
+ val estimate = model.beta
+ val Ranswers = dvec(-1.336265, -13.157702, -4.152654, -5.679908, 163.179329)
+
+ val epsilon = 1E-6
+ (estimate - Ranswers).sum should be < epsilon
+
+ }
+
--- End diff --

It would be good to have a couple of more tests here; at least one for `Transform(...)`
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 23:03:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833750#comment-15833750 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/246

Trevor- looks really good to me- I've left some comments mainly about hyperparameter being moved to fit(...) from model, I think that his makes sense in many ways, E.g, When doing an highly iterative Hyperparameter search, It would eliminate a good amount of overhead to call:

```aModel.fit(....,HyperParameters: Map["hParameter1" -> value , "hParameter2" -> ...])```

rather than re-constructing the entire class each time. As well as i noted in line, I think that the `fit(...)` method should have the ability to return at least a `List[double]` of errors per row if needed, So I would suggest that it return `Any` rather than Unit in the base Traits. (unless the convention that you're following is to rely on predict for this.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 23:08:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833751#comment-15833751 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/246

I want to emphasize though that if you're building up to follow a convention (sk-learn-like, MlLib etc) which I am not familiar with That may e better to follow than my suggestions to make this framework as easy on new users as possible.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 23:27:27 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833758#comment-15833758 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user rawkintrevo commented on the issue:

https://github.com/apache/mahout/pull/246

Thanks for the review @andrewpalumbo !

[sklearn parameters are set when the `Estimator` is instantiated.](http://scikit-learn.org/stable/tutorial/statistical_inference/settings.html).

MLlib on the other hand, [passes parameter maps in `fit` as you suggest](https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/ml/Estimator.html#fit(org.apache.spark.sql.Dataset,%20org.apache.spark.ml.param.ParamMap))

BOTH however, allow hyper parameters to be updated. And in the case you refer too, the model would not be re-instantiated, but something like this:

```scala
model.param1 = 1
model.fit(X, y)
model.param2 = 2
```

To your point, I also want to make this as easy as possible for new users- so I think it would be best to leave the option to pass a parameter map at initiation, and also expose it as a optional parameter of the `fit` method.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 23:32:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833760#comment-15833760 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97238168

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,37 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+ * Abstract of Regressors
+ */
+trait Regressor[K] extends Model {
+
+ var residuals: DrmLike[K] = _
+
+ def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Unit
--- End diff --

by errors- do you mean like code errors, or the residuals?
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 23:33:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833761#comment-15833761 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97238201

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/transformer/StandardScaler.scala ---
@@ -0,0 +1,119 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.transformer
+
+import org.apache.mahout.math.drm
+
+import org.apache.mahout.math.scalabindings._
+
+import org.apache.mahout.math.scalabindings.RLikeVectorOps
+import org.apache.mahout.math.{Vector => MahoutVector}
+
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.scalabindings.RLikeVectorOps
+import org.apache.mahout.math.scalabindings.MatrixOps
+
+import org.apache.mahout.math._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.drm.RLikeDrmOps._
+
+
+import org.apache.mahout.math.Matrix
+
+import collection._
+import JavaConversions._
+
+import Math.sqrt
+
+import scala.reflect.{ClassTag,classTag}
+
+/**
+ * Scales columns to mean 0 and unit variance
+ */
+class StandardScaler extends Transformer{
+ var meanVec: MahoutVector = _
+ var variance: MahoutVector = _
+ var stdev: MahoutVector = _
+ var summary = ""
+
+ def fit[K](input: DrmLike[K]) = {
+ val mNv = dcolMeanVars(input)
+ meanVec = mNv._1
+ variance = mNv._2
+ stdev = mNv._2.sqrt
+ isFit = true
+
+ }
+
+ def transform[K: ClassTag](input: DrmLike[K]): DrmLike[K] = {
+
+ if (!isFit) {
+ //throw an error
--- End diff --

correct and agreed- sklearn convention is to also have a `fitTransform` method to do it in one step.

will add that.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-22 23:55:27 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833763#comment-15833763 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/246

@rawkintrevo Great- best of both worlds then!
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-23 00:25:27 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833768#comment-15833768 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97239580

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,37 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+ * Abstract of Regressors
+ */
+trait Regressor[K] extends Model {
+
+ var residuals: DrmLike[K] = _
+
+ def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Unit
--- End diff --

yes eg. residuals, a hessian matrix, etc, anything that a user who is developing their own algorithm might like to see returned from their own `fit(..)` method. My point was No reason to limit the return value to a `Unit`, rather I'd think `Any` would be a more appropriate return value from the base trait. Though maybe This would not work with the pipeline structure that you're setting up?
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-24 20:50:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836619#comment-15836619 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97647112

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/CochraneOrcutt.scala ---
@@ -0,0 +1,83 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.drm.CacheHint
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+
+class CochraneOrcutt[K](hyperparameters: Map[String, Any] = Map("" -> None)) extends LinearRegressor[K] {
+ // https://en.wikipedia.org/wiki/Cochrane%E2%80%93Orcutt_estimation
+
+ var regressor: LinearRegressor[K] = hyperparameters.asInstanceOf[Map[String, LinearRegressor[K]]].getOrElse("regressor", new OrdinaryLeastSquares())
+ var iterations: Int = hyperparameters.asInstanceOf[Map[String, Int]].getOrElse("iterations", 3)
+ var cacheHint: CacheHint.CacheHint = hyperparameters.asInstanceOf[Map[String, CacheHint.CacheHint]].getOrElse("cacheHint", CacheHint.MEMORY_ONLY)
+ // For larger inputs, CacheHint.MEMORY_AND_DISK2 is reccomended.
+
+ var betas: Array[MahoutVector] = _
+
+ var summary = ""
+
+ def fit(drmFeatures: DrmLike[K],
+ drmTarget: DrmLike[K],
+ hyperparameters: Map[String, Any] = Map("" -> None)): Unit = {
+
+ var hyperparameters: Option[Map[String,Any]] = None
--- End diff --

there should be a `setHyperparameters` right about here...
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-24 21:07:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836633#comment-15836633 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user dlyubimov commented on the issue:

https://github.com/apache/mahout/pull/246

If not too much hassle, please consider using unicode characters →, ⇒, ←.
In intelliJ this is easily facilitated by adding substitution 'live' templates:
![spectacle kp8565](Loading Image...)
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-24 21:08:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836635#comment-15836635 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97647259

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,49 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.regression.tests._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+import scala.reflect.ClassTag
+
+/**
+ * Abstract of Regressors
+ */
+trait Regressor[K] extends Model {
--- End diff --

Regressor still extends model? Regressor's fit IMO should be a `factory method` w.r.t. model instead
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-24 21:08:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836634#comment-15836634 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97646707

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,49 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.regression.tests._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+import scala.reflect.ClassTag
+
+/**
+ * Abstract of Regressors
+ */
+trait Regressor[K] extends Model {
+
+ var residuals: DrmLike[K] = _
+
+ var drmY: DrmLike[K] = _
+
+ def fit(drmFeatures: DrmLike[K],
+ drmTarget: DrmLike[K],
+ hyperparameters: Map[String,Any] = Map("" -> None)): Unit
--- End diff --

Just in case it was not quite clear, my suggestion was to use `hyperparameters: (Symbol, Any)*`. First, symbols are faster for something that meant to be an id, and second, have somewhat more palatable notation, e.g.,
```
val model = fit(X,Y,'k -> 10, 'alpha -> 1e-5)
```
and the signature overall
```
def fit(drmFeatures: DrmLike[K],
+ drmTarget: DrmLike[K],
+ hyperparameters: (Symbol, Any)*): Model
```
of course the implementation can easily get a map, should it need to:
```
val hmap = hyperparameters.toMap
```
That's actually a Scala pattern i developed and used in a similar situation.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-24 21:08:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836636#comment-15836636 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97649184

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/tests/AutocorrelationTests.scala ---
@@ -0,0 +1,54 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression.tests
+
+import org.apache.mahout.math.algorithms.regression.Regressor
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.function.Functions.SQUARE
+import org.apache.mahout.math.scalabindings.RLikeOps._
+
+
+object AutocorrelationTests {
+
+ //https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic
+ /*
+ To test for positive autocorrelation at significance α, the test statistic d is compared to lower and upper critical values (dL,α and dU,α):
+ If d < dL,α, there is statistical evidence that the error terms are positively autocorrelated.
+ If d > dU,α, there is no statistical evidence that the error terms are positively autocorrelated.
+ If dL,α < d < dU,α, the test is inconclusive.
+
+ Rule of Thumb:
+ d < 2 : positive auto-correlation
+ d = 2 : no auto-correlation
+ d > 2 : negative auto-correlation
+ */
+ def DurbinWatson[K](model: Regressor[K]): Regressor[K] = {
+ val e: DrmLike[K] = model.residuals(1 until model.residuals.nrow.toInt, 0 until 1)
--- End diff --

`nrow.toInt` is generally dangerous as it does not catch the algorithm limitation. The problem with this (and i actually have run into this before) is that the algorithm obviously has in this case a limitation of 2 bln rows, and it should explicitly fall apart once this limit is reached, instead of silently producing a nonsense. I think there's a method specifically for this purpose in one of our math-scala package `drm`, `safeToNonNegInt` that would throw IllegalArgument if conversion loses significant bits.

Of course the best approach is to avoid such limitations in the first place, but if unavoidable, please use `safeToNonNegInt`.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-25 05:22:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837211#comment-15837211 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97711337

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,37 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+ * Abstract of Regressors
+ */
+trait Regressor[K] extends Model {
+
+ var residuals: DrmLike[K] = _
+
+ def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Unit
--- End diff --

@andrewpalumbo re `drmFeatures` vs `drmX` and @dlyubimov
We're mixing conventions-
I feel like, for consistency- we either use `drmFeatures` and `drmTarget` (similar to sparkML) or `drmX` and `drmY` similar to sklearn and R- leaving for now, but open to debate- I have a slight bias towards `drmX` , `drmY`
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-25 06:17:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837256#comment-15837256 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97715579

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,49 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.regression.tests._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+import scala.reflect.ClassTag
+
+/**
+ * Abstract of Regressors
+ */
+trait Regressor[K] extends Model {
--- End diff --

I agree that Spark and Flink follow the paradigm you are suggesting, however sklearn doesn't. If we're just going off of what others do, following the other larger packages- yea we should probably follow conventions of what other scala based "Big Data" packages do. However, I can't understand WHY they do it that way- it makes the code hard to read/follow and I assume is an artifact of all the serialization and the way they execute models (having to ship object around for map / reduce phases), that is to say they do it because _they are forced to_ and _at the expense of_ readability.

In Mahout, most of that is taken care of at the distributed engine level.

If we start going down the rabbit hole of "do as Spark and Flink do" we may find ourselves with [entire class just for the summary of a linear model](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L620). I for one, want to stay as far away from that as possible. I'd like to see algorithm code (these and future) be as succinct and tractable as possible so that

1. New contributors aren't intimidated (that is to say are encouraged to commit algorithms)
2. Those algorithms can be easily reviewed and maintained with minimal Scala knowledge (as it limits the pool of willing and able contributors who understand the actual math at play)

That isn't to say, at the end of the day, your proposal is incorrect- you usually are correct and I value and appreciate you taking the time to review. I am saying, " i think that's how this pattern goes in most kits." is neither necessary nor sufficient imo, as in some respects I'm explicitly trying to avoid the approach of other packages, in this case- refactoring something to be more complex with no clear understanding of the benefit.
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-25 15:51:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837948#comment-15837948 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97808310

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/tests/FittnessTests.scala ---
@@ -0,0 +1,52 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression.tests
+
+import org.apache.mahout.math.algorithms.regression.Regressor
+import org.apache.mahout.math.algorithms.transformer.MeanCenter
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.function.Functions.SQUARE
+import org.apache.mahout.math.scalabindings.RLikeOps._
+
+import scala.reflect.ClassTag
+
+object FittnessTests {
+
+ // https://en.wikipedia.org/wiki/Coefficient_of_determination
+ def CoefficientOfDetermination[R[K] <: Regressor[K], K](model: R[K],
+ drmTarget: DrmLike[K]): R[K] = {
+ val sumSquareResiduals = model.residuals.assign(SQUARE).sum
+ val mc = new MeanCenter()
+ val totalResiduals = mc.fitTransform(drmTarget)
+ val sumSquareTotal = totalResiduals.assign(SQUARE).sum
+ val r2 = 1 - (sumSquareResiduals / sumSquareTotal)
+ model.testResults += ("r2" -> r2)
+ model.summary += s"\nR^2: ${r2}"
+ model
+ }
+
+ // https://en.wikipedia.org/wiki/Mean_squared_error
+ def MeanSquareError[R[K] <: Regressor[K], K](model: R[K]): R[K] = {
+ val mse = model.residuals.assign(SQUARE).sum / model.residuals.nrow
--- End diff --

will update this to `safeToNonNegInt(`
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-25 15:54:26 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837955#comment-15837955 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97809121

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/CochraneOrcutt.scala ---
@@ -0,0 +1,89 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.drm.CacheHint
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+
+class CochraneOrcutt[K](hyperparameters: (Symbol, Any)*) extends LinearRegressor[K] {
+ // https://en.wikipedia.org/wiki/Cochrane%E2%80%93Orcutt_estimation
+
+ var regressor: LinearRegressor[K] = hyperparameters.asInstanceOf[Map[Symbol, LinearRegressor[K]]].getOrElse('regressor, new OrdinaryLeastSquares())
+ var iterations: Int = hyperparameters.asInstanceOf[Map[Symbol, Int]].getOrElse('iterations, 3)
+ var cacheHint: CacheHint.CacheHint = hyperparameters.asInstanceOf[Map[Symbol, CacheHint.CacheHint]].getOrElse('cacheHint, CacheHint.MEMORY_ONLY)
+ // For larger inputs, CacheHint.MEMORY_AND_DISK2 is reccomended.
+
+ var betas: Array[MahoutVector] = _
+
+ var summary = ""
+
+ setHyperparameters(hyperparameters.toMap)
+
+ def setHyperparameters(hyperparameters: Map[Symbol, Any] = Map('foo -> None)): Unit = {
+ regressor = hyperparameters.asInstanceOf[Map[Symbol, LinearRegressor[K]]].getOrElse('regressor, new OrdinaryLeastSquares())
+ iterations = hyperparameters.asInstanceOf[Map[Symbol, Int]].getOrElse('iterations, 3)
+ cacheHint = hyperparameters.asInstanceOf[Map[Symbol, CacheHint.CacheHint]].getOrElse('cacheHint, CacheHint.MEMORY_ONLY)
+ }
+
+ def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K], hyperparameters: (Symbol, Any)*): Unit = {
+
+ var hyperparameters: Option[Map[String,Any]] = None
+ betas = new Array[MahoutVector](iterations)
+ regressor.fit(drmFeatures, drmTarget)
+ betas(0) = regressor.beta
+
+ drmY = drmTarget
+
+ val Y = drmTarget(1 until drmTarget.nrow.toInt, 0 until 1).checkpoint(cacheHint)
+ val Y_lag = drmTarget(0 until drmTarget.nrow.toInt - 1, 0 until 1).checkpoint(cacheHint)
+ val X = drmFeatures(1 until drmFeatures.nrow.toInt, 0 until 1).checkpoint(cacheHint)
+ val X_lag = drmFeatures(0 until drmFeatures.nrow.toInt - 1, 0 until 1).checkpoint(cacheHint)
--- End diff --

missed all of these- but have since updated with `safeToNonNegInt(`
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-01-27 01:36:24 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840810#comment-15840810 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r98131529

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala ---
@@ -0,0 +1,49 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.regression.tests._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+import scala.reflect.ClassTag
+
+/**
+ * Abstract of Regressors
+ */
+trait Regressor[K] extends Model {
--- End diff --

in most kits i know (including two i am working on right now myself) the pattern is that fit returns a model object. sckit seems to be on the outlier end on this.

Also i think the approach "package A does it like this and we do/don't like it therefore it is good/bad" to be a dogma fallacy. IMO We just need to do what makes sense.

And it makes sense to me to serialize or persist the model, not the (fitting algorithm+model). This will cause problems both on user and implementation ends IMO
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
ASF GitHub Bot (JIRA)
2017-02-01 03:24:52 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847923#comment-15847923 ]

ASF GitHub Bot commented on MAHOUT-1856:
----------------------------------------

Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/246
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
Hudson (JIRA)
2017-02-01 03:46:51 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847936#comment-15847936 ]

Hudson commented on MAHOUT-1856:
--------------------------------

FAILURE: Integrated in Jenkins build Mahout-Quality #3412 (See [https://builds.apache.org/job/Mahout-Quality/3412/])
MAHOUT-1856 Add Framework for Models, Fitters, and Tests closes (rawkintrevo: rev 9a31923eae3727d9d91bd2c2ed8df12a616a577e)
* (add) spark/src/test/scala/org/apache/mahout/math/algorithms/RegressionTestsSuite.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/tests/AutocorrelationTests.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/OrdinaryLeastSquaresModel.scala
* (add) math-scala/src/test/scala/org/apache/mahout/math/algorithms/RegressionSuiteBase.scala
* (edit) .gitignore
* (add) flink/src/test/scala/org/apache/mahout/flinkbindings/standard/RegressionSuite.scala
* (add) math-scala/src/test/scala/org/apache/mahout/math/algorithms/RegressionTestsSuiteBase.scala
* (add) spark/src/test/scala/org/apache/mahout/math/algorithms/PreprocessorSuite.scala
* (add) math-scala/src/test/scala/org/apache/mahout/math/algorithms/PreprocessorSuiteBase.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/preprocessing/PreprocessorModel.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/SupervisedFitter.scala
* (add) flink/src/test/scala/org/apache/mahout/flinkbindings/standard/PreprocessorSuite.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/Fitter.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/UnsupervisedFitter.scala
* (add) h2o/src/test/scala/org/apache/mahout/math/algorithms/RegressionSuite.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/preprocessing/StandardScaler.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/tests/FittnessTests.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/CochraneOrcuttModel.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/Model.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/RegressorModel.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/UnsupervisedModel.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/LinearRegressorModel.scala
* (add) h2o/src/test/scala/org/apache/mahout/math/algorithms/RegressionTestsSuite.scala
* (add) spark/src/test/scala/org/apache/mahout/math/algorithms/RegressionSuite.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/preprocessing/MeanCenter.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/SupervisedModel.scala
* (add) flink/src/test/scala/org/apache/mahout/flinkbindings/standard/RegressionTestsSuite.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/preprocessing/AsFactor.scala
* (add) h2o/src/test/scala/org/apache/mahout/math/algorithms/PreprocessorSuite.scala
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
Andrew Palumbo (JIRA)
2017-02-01 21:10:51 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848930#comment-15848930 ]

Andrew Palumbo commented on MAHOUT-1856:
----------------------------------------

[~rawkintrevo] can this be marked as resolved? Or is there still more to do here for 0.13.0?
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
Trevor Grant (JIRA)
2017-02-01 21:39:51 UTC
Permalink
[ https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Trevor Grant resolved MAHOUT-1856.
----------------------------------
Resolution: Implemented
Post by Andrew Palumbo (JIRA)
Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms
------------------------------------------------------------------------------------------
Key: MAHOUT-1856
URL: https://issues.apache.org/jira/browse/MAHOUT-1856
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.1
Reporter: Andrew Palumbo
Assignee: Trevor Grant
Priority: Critical
Fix For: 0.13.0
To ensure that Mahout does not become "A loose bag of algorithms", Create basic traits with funtions common to each class of algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Loading...