Debo,
We are certainly interested in online clustering Algorithms, and
clustering of timeseries seems like a great fit. (our text vectorization
pipeline has not yet been reworked for the new Mahout "Samsara" but that is
an interest too). What type of compute platform would you require for this?
For data processing pipeline, the requirements are :
(A) it should be agnostic to any distributed processing engine like
Spark, Flink, etc.
(b) should be able to scale data pipelines and be able to support back
pressure.
(c) should be able to ingest both Batch and Streaming data from Spark,
Flink, Beam etc...
So far Apache NiFi seems to fit the bill for all of the above criteria
(they don't have a Beam interface yet but is being worked on) and they also
have an excellent GUI along with features to define common workflow
templates that could be imported into custom workflows.
The other alternatives being considered are Airbnb's Airflow - proposed for
Apache incubator and defines workflows as a DAG in python,
Apache Beam.
Currently we are not looking at FPGAs.
If any of the Math packages handle FPGAs natively out-of-the-box, let's go
for it. But we need not optimize the heck to get the last bit of
performance from FPGAs.
The most recent, and only real Documentation for Mahout Samsara is in
http://www.weatheringthroughtechdays.com/2016/02/mahout-samsara-book-is-out.html.
You may want to check that out as a reference.
(I'm sorry for the shameless plug but it is the only thing that cover most
all Mahout "Samsara" features and architecture up to our previous release)
I don't see this as a shameless plug, its definitely much better than the
dozen low grade books that have been churned out by PackT publishers and
went nowhere, other than bringing disrepute to the project and community.
Please do let us know if you have any questions about the Samsara platform.
________________________________________
Sent: Tuesday, May 17, 2016 8:35:04 PM
Subject: Re: [NEW member] Hi
Thanks Andy! Would like to see if there is interest for algorithms such as
1) clustering text in an online fashion (maybe using LSH or sim/min hash)
or 2) online clustering of time series. Basically my focus is "online" or
real time.
LSH on GPU sounds very interesting and would love to look at the patches.
Personally have helped accelerate LSH on TCAMs long ago e.g.
http://arxiv.org/abs/1006.3514 .... Is GPU the only hw accel you are
looking at or are you considering PCIe FPGA cards too?
debo
Welcome, Debojyoti.
We look forward to your contributiins. We are currently working towards
integrating GPU acceleration for our 0.13 release and LSH sounds like a
great addition. Could you tell us some more about what you would like to
do?
Let us know if we can help you get familiar with the mahout code base.
We
try to implement algorithms in the math-scala module.
Thanks,
Andy
-------- Original message --------
Date: 05/17/2016 8:11 PM (GMT-05:00)
Subject: [NEW member] Hi
Hi there,
Am very interested in contributing to Mahout especially towards fast ML
kernels that can be used for streaming. Have some experience with LSH
based
techniques (including hw accel) for clustering and near neighbors based
stuff in general.
Was chatting with Sunil and he suggested I join the merry band.
regards
-Debo~
--
-Debo~