Discussion:
Few Questions regarding Mahout
Aditya
2017-03-31 23:47:51 UTC
Permalink
Hi everyone,

I've been talking with Trevor over email and he shared some documents with
me. They contained content that he (along with a few others) were
developing to make Mahout easily accessible to newbies like myself.

I've gone through the planned blog posts titled "Why Mahout", "Getting
Started with Mahout", "Algorithms Framework" and "Building Apache Mahout
from Source" and I have to say, I've got a lot of questions. Since Trevor
is on vacation and the deadline for final proposal submission is fast
approaching, I thought I'll post my questions on the dev forum.

So here goes the big list of my questions. I hope of those of you who were
/ are involved in the development of these blog posts will be able to help
me. Some of the questions are vague / abstract, I suggest you answer them
as if you're explaining it to a layman.

1. Could you elaborate to me the high-level structure of Mahout?

2. What are the plans in pipeline for Mahout's development in the months to
come?

3. How does contribution of a new algorithm work in Mahout? When I was
reading the doc "Getting Started with Mahout" the example implemented the
Ordinary Least Squares Regression in Samsara, Mahout's DSL.
I had something different in my mind before reading the blog posts. I had
thought that I would be contributing the distributed algorithm to Mahout
from scratch, written in Scala and make it available as a package (which
users can import and use) to users who use Mahout.

4. In general, is there a plan to contribute the algorithms in future using
Samsara only? If so, what will be the limitations and advantages of this
decision? I mean, the algorithms that will be a part of Mahout in the
future, is there a plan to write all of them in Samsara.

5. What are the building blocks of Mahout that enable the distributed
processing? The blog post mentions the Distributed Row Matrix. Are there
any other distributed data structures available? If not, won't the
algorithms that can be a part of the Mahout framework in the future become
limited? Meaning, algorithms that cannot be reduced to a Linear Algebra
problem?

6. What is expected of a newbie in the community? What is the learning
curve to become an active contributor to Mahout? Are there any specific
books / blog posts that I can read that will make the process easier?

7. Also, if you could give me some background as to how the development of
Mahout has been going on. Not the motivation / inspiration that led to
Mahout's conception but something like, what work has gone on between the
previous release and the current release candidate.

8. What was the high level motivation of developing Mahout's own DSL,
Samsara?

Regards,
Aditya
Aditya
2017-04-02 09:06:26 UTC
Permalink
Hello again,

I hope most of you had the time to read through the previous mail. It would
mean a lot if you could answer (in partial at least) the above questions.

Thanks,
Aditya
Post by Aditya
Hi everyone,
I've been talking with Trevor over email and he shared some documents with
me. They contained content that he (along with a few others) were
developing to make Mahout easily accessible to newbies like myself.
I've gone through the planned blog posts titled "Why Mahout", "Getting
Started with Mahout", "Algorithms Framework" and "Building Apache Mahout
from Source" and I have to say, I've got a lot of questions. Since Trevor
is on vacation and the deadline for final proposal submission is fast
approaching, I thought I'll post my questions on the dev forum.
So here goes the big list of my questions. I hope of those of you who were
/ are involved in the development of these blog posts will be able to help
me. Some of the questions are vague / abstract, I suggest you answer them
as if you're explaining it to a layman.
1. Could you elaborate to me the high-level structure of Mahout?
2. What are the plans in pipeline for Mahout's development in the months
to come?
3. How does contribution of a new algorithm work in Mahout? When I was
reading the doc "Getting Started with Mahout" the example implemented the
Ordinary Least Squares Regression in Samsara, Mahout's DSL.
I had something different in my mind before reading the blog posts. I had
thought that I would be contributing the distributed algorithm to Mahout
from scratch, written in Scala and make it available as a package (which
users can import and use) to users who use Mahout.
4. In general, is there a plan to contribute the algorithms in future
using Samsara only? If so, what will be the limitations and advantages of
this decision? I mean, the algorithms that will be a part of Mahout in the
future, is there a plan to write all of them in Samsara.
5. What are the building blocks of Mahout that enable the distributed
processing? The blog post mentions the Distributed Row Matrix. Are there
any other distributed data structures available? If not, won't the
algorithms that can be a part of the Mahout framework in the future become
limited? Meaning, algorithms that cannot be reduced to a Linear Algebra
problem?
6. What is expected of a newbie in the community? What is the learning
curve to become an active contributor to Mahout? Are there any specific
books / blog posts that I can read that will make the process easier?
7. Also, if you could give me some background as to how the development of
Mahout has been going on. Not the motivation / inspiration that led to
Mahout's conception but something like, what work has gone on between the
previous release and the current release candidate.
8. What was the high level motivation of developing Mahout's own DSL,
Samsara?
Regards,
Aditya
dustin vanstee
2017-04-03 20:41:12 UTC
Permalink
Hi Aditya, I am new to the project myself so I can't comment on all your
questions but here are a few comments I have for you ..

1. High level structure of Mahout
Trevor gave a presentation at a meetup that had a nice architecture diagram
that shows the layers.

Mainly its using the Samsara DSL to write backend agnostic algorithms.
Then let Mahout do the mapping and optimizations to the backend based on
what one you are using ...

[image: Inline image 1]
3. How does contribution of a new algorithm work in Mahout? When I was
reading the doc "Getting Started with Mahout" the example implemented the
Ordinary Least Squares Regression in Samsara, Mahout's DSL.
I had something different in my mind before reading the blog posts. I had
thought that I would be contributing the distributed algorithm to Mahout
from scratch, written in Scala and make it available as a package (which
users can import and use) to users who use Mahout.

I think the idea is to let the backend engine figure out how to best
distribute the work. That said, when writing a binding to a particular
backend a lot of work is probably put into the best implementation of how
represent a DRM.

4. In general, is there a plan to contribute the algorithms in future using
Samsara only? If so, what will be the limitations and advantages of this
decision? I mean, the algorithms that will be a part of Mahout in the
future, is there a plan to write all of them in Samsara.

I think thats where the sweet spot is ... backend agnostic code.


6. What is expected of a newbie in the community? What is the learning
curve to become an active contributor to Mahout? Are there any specific
books / blog posts that I can read that will make the process easier?

As a newbie, I think its participating in the building/testing of code
releases. Also working on some simple JIRAs. Based on my experience,
working on my first JIRA is helping me get more familiar with some small
aspects of the overall project. I think you will need to get good with
intelliJ to help you read/write/test code. I perused Trevors documents,
and all the writeups in the mahout website. Beyond that, just trying
things in code will help.


Sorry, don't have tons of answers myself, but his is what I have found out
so far. Hope that helps.
Post by Aditya
Hi everyone,
I've been talking with Trevor over email and he shared some documents with
me. They contained content that he (along with a few others) were
developing to make Mahout easily accessible to newbies like myself.
I've gone through the planned blog posts titled "Why Mahout", "Getting
Started with Mahout", "Algorithms Framework" and "Building Apache Mahout
from Source" and I have to say, I've got a lot of questions. Since Trevor
is on vacation and the deadline for final proposal submission is fast
approaching, I thought I'll post my questions on the dev forum.
So here goes the big list of my questions. I hope of those of you who were
/ are involved in the development of these blog posts will be able to help
me. Some of the questions are vague / abstract, I suggest you answer them
as if you're explaining it to a layman.
1. Could you elaborate to me the high-level structure of Mahout?
2. What are the plans in pipeline for Mahout's development in the months to
come?
3. How does contribution of a new algorithm work in Mahout? When I was
reading the doc "Getting Started with Mahout" the example implemented the
Ordinary Least Squares Regression in Samsara, Mahout's DSL.
I had something different in my mind before reading the blog posts. I had
thought that I would be contributing the distributed algorithm to Mahout
from scratch, written in Scala and make it available as a package (which
users can import and use) to users who use Mahout.
4. In general, is there a plan to contribute the algorithms in future using
Samsara only? If so, what will be the limitations and advantages of this
decision? I mean, the algorithms that will be a part of Mahout in the
future, is there a plan to write all of them in Samsara.
5. What are the building blocks of Mahout that enable the distributed
processing? The blog post mentions the Distributed Row Matrix. Are there
any other distributed data structures available? If not, won't the
algorithms that can be a part of the Mahout framework in the future become
limited? Meaning, algorithms that cannot be reduced to a Linear Algebra
problem?
6. What is expected of a newbie in the community? What is the learning
curve to become an active contributor to Mahout? Are there any specific
books / blog posts that I can read that will make the process easier?
7. Also, if you could give me some background as to how the development of
Mahout has been going on. Not the motivation / inspiration that led to
Mahout's conception but something like, what work has gone on between the
previous release and the current release candidate.
8. What was the high level motivation of developing Mahout's own DSL,
Samsara?
Regards,
Aditya
Trevor Grant
2017-04-04 15:05:43 UTC
Permalink
Good questions Aditya, and awesome response Dustin et al.

I'm back in, and trying to work my way through emails I missed while out.

The Meetup presentation referenced is available in full here.
https://github.com/rawkintrevo/presentations/blob/master/Mahout%20Whats%20Next%20DFW%20Meetup.pdf

Hopefully that will be a somewhat useful "structure" overview.

To all watching, the write ups I have mentioned are a series of blog posts
I intend to push out ASAP, specifically aimed at new users (to Aditya's
point number 6). At the moment they are incomplete/poorly
edited/unclear/possibly incorrect in spots. I promise to publish once they
are clean!

tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by dustin vanstee
Hi Aditya, I am new to the project myself so I can't comment on all your
questions but here are a few comments I have for you ..
1. High level structure of Mahout
Trevor gave a presentation at a meetup that had a nice architecture
diagram that shows the layers.
Mainly its using the Samsara DSL to write backend agnostic algorithms.
Then let Mahout do the mapping and optimizations to the backend based on
what one you are using ...
[image: Inline image 1]
3. How does contribution of a new algorithm work in Mahout? When I was
reading the doc "Getting Started with Mahout" the example implemented the
Ordinary Least Squares Regression in Samsara, Mahout's DSL.
I had something different in my mind before reading the blog posts. I had
thought that I would be contributing the distributed algorithm to Mahout
from scratch, written in Scala and make it available as a package (which
users can import and use) to users who use Mahout.
I think the idea is to let the backend engine figure out how to best
distribute the work. That said, when writing a binding to a particular
backend a lot of work is probably put into the best implementation of how
represent a DRM.
4. In general, is there a plan to contribute the algorithms in future using
Samsara only? If so, what will be the limitations and advantages of this
decision? I mean, the algorithms that will be a part of Mahout in the
future, is there a plan to write all of them in Samsara.
I think thats where the sweet spot is ... backend agnostic code.
6. What is expected of a newbie in the community? What is the learning
curve to become an active contributor to Mahout? Are there any specific
books / blog posts that I can read that will make the process easier?
As a newbie, I think its participating in the building/testing of code
releases. Also working on some simple JIRAs. Based on my experience,
working on my first JIRA is helping me get more familiar with some small
aspects of the overall project. I think you will need to get good with
intelliJ to help you read/write/test code. I perused Trevors documents,
and all the writeups in the mahout website. Beyond that, just trying
things in code will help.
Sorry, don't have tons of answers myself, but his is what I have found out
so far. Hope that helps.
Post by Aditya
Hi everyone,
I've been talking with Trevor over email and he shared some documents with
me. They contained content that he (along with a few others) were
developing to make Mahout easily accessible to newbies like myself.
I've gone through the planned blog posts titled "Why Mahout", "Getting
Started with Mahout", "Algorithms Framework" and "Building Apache Mahout
from Source" and I have to say, I've got a lot of questions. Since Trevor
is on vacation and the deadline for final proposal submission is fast
approaching, I thought I'll post my questions on the dev forum.
So here goes the big list of my questions. I hope of those of you who were
/ are involved in the development of these blog posts will be able to help
me. Some of the questions are vague / abstract, I suggest you answer them
as if you're explaining it to a layman.
1. Could you elaborate to me the high-level structure of Mahout?
2. What are the plans in pipeline for Mahout's development in the months to
come?
3. How does contribution of a new algorithm work in Mahout? When I was
reading the doc "Getting Started with Mahout" the example implemented the
Ordinary Least Squares Regression in Samsara, Mahout's DSL.
I had something different in my mind before reading the blog posts. I had
thought that I would be contributing the distributed algorithm to Mahout
from scratch, written in Scala and make it available as a package (which
users can import and use) to users who use Mahout.
4. In general, is there a plan to contribute the algorithms in future using
Samsara only? If so, what will be the limitations and advantages of this
decision? I mean, the algorithms that will be a part of Mahout in the
future, is there a plan to write all of them in Samsara.
5. What are the building blocks of Mahout that enable the distributed
processing? The blog post mentions the Distributed Row Matrix. Are there
any other distributed data structures available? If not, won't the
algorithms that can be a part of the Mahout framework in the future become
limited? Meaning, algorithms that cannot be reduced to a Linear Algebra
problem?
6. What is expected of a newbie in the community? What is the learning
curve to become an active contributor to Mahout? Are there any specific
books / blog posts that I can read that will make the process easier?
7. Also, if you could give me some background as to how the development of
Mahout has been going on. Not the motivation / inspiration that led to
Mahout's conception but something like, what work has gone on between the
previous release and the current release candidate.
8. What was the high level motivation of developing Mahout's own DSL,
Samsara?
Regards,
Aditya
Loading...