Discussion:
[NEW member] Hi
Debojyoti Dutta
2016-05-18 00:10:35 UTC
Permalink
Hi there,

Am very interested in contributing to Mahout especially towards fast ML
kernels that can be used for streaming. Have some experience with LSH based
techniques (including hw accel) for clustering and near neighbors based
stuff in general.

Was chatting with Sunil and he suggested I join the merry band.

regards
-Debo~
Debojyoti Dutta
2016-05-18 00:35:04 UTC
Permalink
Thanks Andy! Would like to see if there is interest for algorithms such as
1) clustering text in an online fashion (maybe using LSH or sim/min hash)
or 2) online clustering of time series. Basically my focus is "online" or
real time.

LSH on GPU sounds very interesting and would love to look at the patches.
Personally have helped accelerate LSH on TCAMs long ago e.g.
http://arxiv.org/abs/1006.3514 .... Is GPU the only hw accel you are
looking at or are you considering PCIe FPGA cards too?

debo
Welcome, Debojyoti.
We look forward to your contributiins. We are currently working towards
integrating GPU acceleration for our 0.13 release and LSH sounds like a
great addition. Could you tell us some more about what you would like to do?
Let us know if we can help you get familiar with the mahout code base. We
try to implement algorithms in the math-scala module.
Thanks,
Andy
-------- Original message --------
Date: 05/17/2016 8:11 PM (GMT-05:00)
Subject: [NEW member] Hi
Hi there,
Am very interested in contributing to Mahout especially towards fast ML
kernels that can be used for streaming. Have some experience with LSH based
techniques (including hw accel) for clustering and near neighbors based
stuff in general.
Was chatting with Sunil and he suggested I join the merry band.
regards
-Debo~
--
-Debo~
Suneel Marthi
2016-05-28 13:50:19 UTC
Permalink
Debo,
We are certainly interested in online clustering Algorithms, and
clustering of timeseries seems like a great fit. (our text vectorization
pipeline has not yet been reworked for the new Mahout "Samsara" but that is
an interest too). What type of compute platform would you require for this?
For data processing pipeline, the requirements are :
(A) it should be agnostic to any distributed processing engine like
Spark, Flink, etc.
(b) should be able to scale data pipelines and be able to support back
pressure.
(c) should be able to ingest both Batch and Streaming data from Spark,
Flink, Beam etc...

So far Apache NiFi seems to fit the bill for all of the above criteria
(they don't have a Beam interface yet but is being worked on) and they also
have an excellent GUI along with features to define common workflow
templates that could be imported into custom workflows.

The other alternatives being considered are Airbnb's Airflow - proposed for
Apache incubator and defines workflows as a DAG in python,
Apache Beam.
Currently we are not looking at FPGAs.
If any of the Math packages handle FPGAs natively out-of-the-box, let's go
for it. But we need not optimize the heck to get the last bit of
performance from FPGAs.
The most recent, and only real Documentation for Mahout Samsara is in
http://www.weatheringthroughtechdays.com/2016/02/mahout-samsara-book-is-out.html.
You may want to check that out as a reference.
(I'm sorry for the shameless plug but it is the only thing that cover most
all Mahout "Samsara" features and architecture up to our previous release)
I don't see this as a shameless plug, its definitely much better than the
dozen low grade books that have been churned out by PackT publishers and
went nowhere, other than bringing disrepute to the project and community.
Please do let us know if you have any questions about the Samsara platform.
________________________________________
Sent: Tuesday, May 17, 2016 8:35:04 PM
Subject: Re: [NEW member] Hi
Thanks Andy! Would like to see if there is interest for algorithms such as
1) clustering text in an online fashion (maybe using LSH or sim/min hash)
or 2) online clustering of time series. Basically my focus is "online" or
real time.
LSH on GPU sounds very interesting and would love to look at the patches.
Personally have helped accelerate LSH on TCAMs long ago e.g.
http://arxiv.org/abs/1006.3514 .... Is GPU the only hw accel you are
looking at or are you considering PCIe FPGA cards too?
debo
Welcome, Debojyoti.
We look forward to your contributiins. We are currently working towards
integrating GPU acceleration for our 0.13 release and LSH sounds like a
great addition. Could you tell us some more about what you would like to
do?
Let us know if we can help you get familiar with the mahout code base.
We
try to implement algorithms in the math-scala module.
Thanks,
Andy
-------- Original message --------
Date: 05/17/2016 8:11 PM (GMT-05:00)
Subject: [NEW member] Hi
Hi there,
Am very interested in contributing to Mahout especially towards fast ML
kernels that can be used for streaming. Have some experience with LSH
based
techniques (including hw accel) for clustering and near neighbors based
stuff in general.
Was chatting with Sunil and he suggested I join the merry band.
regards
-Debo~
--
-Debo~
Khurrum Nasim
2016-06-01 14:48:51 UTC
Permalink
How are you folks getting over the learning curves associated with things like Nifi and AirFlow ?
Post by Suneel Marthi
Debo,
We are certainly interested in online clustering Algorithms, and
clustering of timeseries seems like a great fit. (our text vectorization
pipeline has not yet been reworked for the new Mahout "Samsara" but that is
an interest too). What type of compute platform would you require for this?
(A) it should be agnostic to any distributed processing engine like
Spark, Flink, etc.
(b) should be able to scale data pipelines and be able to support back
pressure.
(c) should be able to ingest both Batch and Streaming data from Spark,
Flink, Beam etc...
So far Apache NiFi seems to fit the bill for all of the above criteria
(they don't have a Beam interface yet but is being worked on) and they also
have an excellent GUI along with features to define common workflow
templates that could be imported into custom workflows.
The other alternatives being considered are Airbnb's Airflow - proposed for
Apache incubator and defines workflows as a DAG in python,
Apache Beam.
Currently we are not looking at FPGAs.
If any of the Math packages handle FPGAs natively out-of-the-box, let's go
for it. But we need not optimize the heck to get the last bit of
performance from FPGAs.
The most recent, and only real Documentation for Mahout Samsara is in
http://www.weatheringthroughtechdays.com/2016/02/mahout-samsara-book-is-out.html.
You may want to check that out as a reference.
(I'm sorry for the shameless plug but it is the only thing that cover most
all Mahout "Samsara" features and architecture up to our previous release)
I don't see this as a shameless plug, its definitely much better than the
dozen low grade books that have been churned out by PackT publishers and
went nowhere, other than bringing disrepute to the project and community.
Please do let us know if you have any questions about the Samsara platform.
________________________________________
Sent: Tuesday, May 17, 2016 8:35:04 PM
Subject: Re: [NEW member] Hi
Thanks Andy! Would like to see if there is interest for algorithms such as
1) clustering text in an online fashion (maybe using LSH or sim/min hash)
or 2) online clustering of time series. Basically my focus is "online" or
real time.
LSH on GPU sounds very interesting and would love to look at the patches.
Personally have helped accelerate LSH on TCAMs long ago e.g.
http://arxiv.org/abs/1006.3514 .... Is GPU the only hw accel you are
looking at or are you considering PCIe FPGA cards too?
debo
Welcome, Debojyoti.
We look forward to your contributiins. We are currently working towards
integrating GPU acceleration for our 0.13 release and LSH sounds like a
great addition. Could you tell us some more about what you would like to
do?
Let us know if we can help you get familiar with the mahout code base.
We
try to implement algorithms in the math-scala module.
Thanks,
Andy
-------- Original message --------
Date: 05/17/2016 8:11 PM (GMT-05:00)
Subject: [NEW member] Hi
Hi there,
Am very interested in contributing to Mahout especially towards fast ML
kernels that can be used for streaming. Have some experience with LSH
based
techniques (including hw accel) for clustering and near neighbors based
stuff in general.
Was chatting with Sunil and he suggested I join the merry band.
regards
-Debo~
--
-Debo~
Suneel Marthi
2016-06-01 15:01:04 UTC
Permalink
Was that question directed to the community or were u asking urself loud ?
Post by Khurrum Nasim
How are you folks getting over the learning curves associated with things
like Nifi and AirFlow ?
Post by Suneel Marthi
Debo,
We are certainly interested in online clustering Algorithms, and
clustering of timeseries seems like a great fit. (our text
vectorization
Post by Suneel Marthi
pipeline has not yet been reworked for the new Mahout "Samsara" but
that is
Post by Suneel Marthi
an interest too). What type of compute platform would you require for
this?
Post by Suneel Marthi
(A) it should be agnostic to any distributed processing engine like
Spark, Flink, etc.
(b) should be able to scale data pipelines and be able to support back
pressure.
(c) should be able to ingest both Batch and Streaming data from Spark,
Flink, Beam etc...
So far Apache NiFi seems to fit the bill for all of the above criteria
(they don't have a Beam interface yet but is being worked on) and they
also
Post by Suneel Marthi
have an excellent GUI along with features to define common workflow
templates that could be imported into custom workflows.
The other alternatives being considered are Airbnb's Airflow - proposed
for
Post by Suneel Marthi
Apache incubator and defines workflows as a DAG in python,
Apache Beam.
Currently we are not looking at FPGAs.
If any of the Math packages handle FPGAs natively out-of-the-box, let's
go
Post by Suneel Marthi
for it. But we need not optimize the heck to get the last bit of
performance from FPGAs.
The most recent, and only real Documentation for Mahout Samsara is in
http://www.weatheringthroughtechdays.com/2016/02/mahout-samsara-book-is-out.html
.
Post by Suneel Marthi
You may want to check that out as a reference.
(I'm sorry for the shameless plug but it is the only thing that cover
most
Post by Suneel Marthi
all Mahout "Samsara" features and architecture up to our previous
release)
Post by Suneel Marthi
I don't see this as a shameless plug, its definitely much better than the
dozen low grade books that have been churned out by PackT publishers and
went nowhere, other than bringing disrepute to the project and community.
Please do let us know if you have any questions about the Samsara
platform.
Post by Suneel Marthi
________________________________________
Sent: Tuesday, May 17, 2016 8:35:04 PM
Subject: Re: [NEW member] Hi
Thanks Andy! Would like to see if there is interest for algorithms such
as
Post by Suneel Marthi
1) clustering text in an online fashion (maybe using LSH or sim/min
hash)
Post by Suneel Marthi
or 2) online clustering of time series. Basically my focus is "online"
or
Post by Suneel Marthi
real time.
LSH on GPU sounds very interesting and would love to look at the
patches.
Post by Suneel Marthi
Personally have helped accelerate LSH on TCAMs long ago e.g.
http://arxiv.org/abs/1006.3514 .... Is GPU the only hw accel you are
looking at or are you considering PCIe FPGA cards too?
debo
Welcome, Debojyoti.
We look forward to your contributiins. We are currently working
towards
Post by Suneel Marthi
integrating GPU acceleration for our 0.13 release and LSH sounds like a
great addition. Could you tell us some more about what you would like
to
Post by Suneel Marthi
do?
Let us know if we can help you get familiar with the mahout code base.
We
try to implement algorithms in the math-scala module.
Thanks,
Andy
-------- Original message --------
Date: 05/17/2016 8:11 PM (GMT-05:00)
Subject: [NEW member] Hi
Hi there,
Am very interested in contributing to Mahout especially towards fast ML
kernels that can be used for streaming. Have some experience with LSH
based
techniques (including hw accel) for clustering and near neighbors based
stuff in general.
Was chatting with Sunil and he suggested I join the merry band.
regards
-Debo~
--
-Debo~
Khurrum Nasim
2016-06-01 15:03:57 UTC
Permalink
To the community, active committers, etc.
Post by Suneel Marthi
Was that question directed to the community or were u asking urself loud ?
Post by Khurrum Nasim
How are you folks getting over the learning curves associated with things
like Nifi and AirFlow ?
Post by Suneel Marthi
Debo,
We are certainly interested in online clustering Algorithms, and
clustering of timeseries seems like a great fit. (our text
vectorization
Post by Suneel Marthi
pipeline has not yet been reworked for the new Mahout "Samsara" but
that is
Post by Suneel Marthi
an interest too). What type of compute platform would you require for
this?
Post by Suneel Marthi
(A) it should be agnostic to any distributed processing engine like
Spark, Flink, etc.
(b) should be able to scale data pipelines and be able to support back
pressure.
(c) should be able to ingest both Batch and Streaming data from Spark,
Flink, Beam etc...
So far Apache NiFi seems to fit the bill for all of the above criteria
(they don't have a Beam interface yet but is being worked on) and they
also
Post by Suneel Marthi
have an excellent GUI along with features to define common workflow
templates that could be imported into custom workflows.
The other alternatives being considered are Airbnb's Airflow - proposed
for
Post by Suneel Marthi
Apache incubator and defines workflows as a DAG in python,
Apache Beam.
Currently we are not looking at FPGAs.
If any of the Math packages handle FPGAs natively out-of-the-box, let's
go
Post by Suneel Marthi
for it. But we need not optimize the heck to get the last bit of
performance from FPGAs.
The most recent, and only real Documentation for Mahout Samsara is in
http://www.weatheringthroughtechdays.com/2016/02/mahout-samsara-book-is-out.html
.
Post by Suneel Marthi
You may want to check that out as a reference.
(I'm sorry for the shameless plug but it is the only thing that cover
most
Post by Suneel Marthi
all Mahout "Samsara" features and architecture up to our previous
release)
Post by Suneel Marthi
I don't see this as a shameless plug, its definitely much better than the
dozen low grade books that have been churned out by PackT publishers and
went nowhere, other than bringing disrepute to the project and community.
Please do let us know if you have any questions about the Samsara
platform.
Post by Suneel Marthi
________________________________________
Sent: Tuesday, May 17, 2016 8:35:04 PM
Subject: Re: [NEW member] Hi
Thanks Andy! Would like to see if there is interest for algorithms such
as
Post by Suneel Marthi
1) clustering text in an online fashion (maybe using LSH or sim/min
hash)
Post by Suneel Marthi
or 2) online clustering of time series. Basically my focus is "online"
or
Post by Suneel Marthi
real time.
LSH on GPU sounds very interesting and would love to look at the
patches.
Post by Suneel Marthi
Personally have helped accelerate LSH on TCAMs long ago e.g.
http://arxiv.org/abs/1006.3514 .... Is GPU the only hw accel you are
looking at or are you considering PCIe FPGA cards too?
debo
Welcome, Debojyoti.
We look forward to your contributiins. We are currently working
towards
Post by Suneel Marthi
integrating GPU acceleration for our 0.13 release and LSH sounds like a
great addition. Could you tell us some more about what you would like
to
Post by Suneel Marthi
do?
Let us know if we can help you get familiar with the mahout code base.
We
try to implement algorithms in the math-scala module.
Thanks,
Andy
-------- Original message --------
Date: 05/17/2016 8:11 PM (GMT-05:00)
Subject: [NEW member] Hi
Hi there,
Am very interested in contributing to Mahout especially towards fast ML
kernels that can be used for streaming. Have some experience with LSH
based
techniques (including hw accel) for clustering and near neighbors based
stuff in general.
Was chatting with Sunil and he suggested I join the merry band.
regards
-Debo~
--
-Debo~
Loading...