Discussion:
stochastic nature
Khurrum Nasim
2016-05-02 14:49:19 UTC
Permalink
Hey All,

I’d like to know if Mahout uses any randomized algorithms. I’m thinking it probably does. Can somebody point me to the packages that utilized randomized algos.

Thanks,

Khurrum
Dmitriy Lyubimov
2016-05-02 16:13:22 UTC
Permalink
yes mahout has stochastic svd and pca which are described at length in the
samsara book. The book examples in Andrew Palumbo's github also contain an
example of computing k-means|| sketch.

if you mean _probabilistic_ algorithms, although i have done some things
outside the public domain, nothing has been contributed.

You are very welcome to try something if you don't have big constraints on
oss contribution.

-d
Post by Khurrum Nasim
Hey All,
I’d like to know if Mahout uses any randomized algorithms. I’m thinking
it probably does. Can somebody point me to the packages that utilized
randomized algos.
Thanks,
Khurrum
Khurrum Nasim
2016-05-02 16:25:54 UTC
Permalink
Hey Dimitri -

Yes I meant probabilistic algorithms. If mahout doesn’t use probabilistic algos then how does it accomplish a degree of optimal parallelization ? Wouldn’t you need randomization to spread out the processing of tasks.
Post by Dmitriy Lyubimov
yes mahout has stochastic svd and pca which are described at length in the
samsara book. The book examples in Andrew Palumbo's github also contain an
example of computing k-means|| sketch.
if you mean _probabilistic_ algorithms, although i have done some things
outside the public domain, nothing has been contributed.
You are very welcome to try something if you don't have big constraints on
oss contribution.
-d
Post by Khurrum Nasim
Hey All,
I’d like to know if Mahout uses any randomized algorithms. I’m thinking
it probably does. Can somebody point me to the packages that utilized
randomized algos.
Thanks,
Khurrum
Dmitriy Lyubimov
2016-05-02 16:39:41 UTC
Permalink
by probabilistic algorithms i mostly mean inference involving monte carlo
type mechanisms (Gibbs sampling LDA which i think might still be part of
our MR collection might be an example, as well as its faster counterpart,
variational Bayes inference.

the parallelization strategies are are just standard spark mechanisms (in
case of spark), mostly are using their standard hash samplers (which are in
math speak are uniform multinomial samplers really).
Post by Khurrum Nasim
Hey Dimitri -
Yes I meant probabilistic algorithms. If mahout doesn’t use probabilistic
algos then how does it accomplish a degree of optimal parallelization ?
Wouldn’t you need randomization to spread out the processing of tasks.
Post by Dmitriy Lyubimov
yes mahout has stochastic svd and pca which are described at length in
the
Post by Dmitriy Lyubimov
samsara book. The book examples in Andrew Palumbo's github also contain
an
Post by Dmitriy Lyubimov
example of computing k-means|| sketch.
if you mean _probabilistic_ algorithms, although i have done some things
outside the public domain, nothing has been contributed.
You are very welcome to try something if you don't have big constraints
on
Post by Dmitriy Lyubimov
oss contribution.
-d
Post by Khurrum Nasim
Hey All,
I’d like to know if Mahout uses any randomized algorithms. I’m
thinking
Post by Dmitriy Lyubimov
Post by Khurrum Nasim
it probably does. Can somebody point me to the packages that utilized
randomized algos.
Thanks,
Khurrum
Khurrum Nasim
2016-05-02 16:47:17 UTC
Permalink
Thanks for the insight Dimitri. I will look further into spark to understand how it handles parallelization and distributed processing.
Post by Dmitriy Lyubimov
by probabilistic algorithms i mostly mean inference involving monte carlo
type mechanisms (Gibbs sampling LDA which i think might still be part of
our MR collection might be an example, as well as its faster counterpart,
variational Bayes inference.
the parallelization strategies are are just standard spark mechanisms (in
case of spark), mostly are using their standard hash samplers (which are in
math speak are uniform multinomial samplers really).
Post by Khurrum Nasim
Hey Dimitri -
Yes I meant probabilistic algorithms. If mahout doesn’t use probabilistic
algos then how does it accomplish a degree of optimal parallelization ?
Wouldn’t you need randomization to spread out the processing of tasks.
Post by Dmitriy Lyubimov
yes mahout has stochastic svd and pca which are described at length in
the
Post by Dmitriy Lyubimov
samsara book. The book examples in Andrew Palumbo's github also contain
an
Post by Dmitriy Lyubimov
example of computing k-means|| sketch.
if you mean _probabilistic_ algorithms, although i have done some things
outside the public domain, nothing has been contributed.
You are very welcome to try something if you don't have big constraints
on
Post by Dmitriy Lyubimov
oss contribution.
-d
Post by Khurrum Nasim
Hey All,
I’d like to know if Mahout uses any randomized algorithms. I’m
thinking
Post by Dmitriy Lyubimov
Post by Khurrum Nasim
it probably does. Can somebody point me to the packages that utilized
randomized algos.
Thanks,
Khurrum
Dmitriy Lyubimov
2016-05-03 00:59:58 UTC
Permalink
also, mahout does have optimizer that simply decides on degree of
parallelism of the _product_. I.e., if it computes C=A'B then it figures
that final results should be split N ways. but it doesn't apply the
partition function -- it just uses the usual hash partitioner to forward
the keys, i don't think we ever override that.
Post by Dmitriy Lyubimov
by probabilistic algorithms i mostly mean inference involving monte carlo
type mechanisms (Gibbs sampling LDA which i think might still be part of
our MR collection might be an example, as well as its faster counterpart,
variational Bayes inference.
the parallelization strategies are are just standard spark mechanisms (in
case of spark), mostly are using their standard hash samplers (which are in
math speak are uniform multinomial samplers really).
Post by Khurrum Nasim
Hey Dimitri -
Yes I meant probabilistic algorithms. If mahout doesn’t use
probabilistic algos then how does it accomplish a degree of optimal
parallelization ? Wouldn’t you need randomization to spread out the
processing of tasks.
Post by Dmitriy Lyubimov
yes mahout has stochastic svd and pca which are described at length in
the
Post by Dmitriy Lyubimov
samsara book. The book examples in Andrew Palumbo's github also contain
an
Post by Dmitriy Lyubimov
example of computing k-means|| sketch.
if you mean _probabilistic_ algorithms, although i have done some things
outside the public domain, nothing has been contributed.
You are very welcome to try something if you don't have big constraints
on
Post by Dmitriy Lyubimov
oss contribution.
-d
Post by Khurrum Nasim
Hey All,
I’d like to know if Mahout uses any randomized algorithms. I’m
thinking
Post by Dmitriy Lyubimov
Post by Khurrum Nasim
it probably does. Can somebody point me to the packages that utilized
randomized algos.
Thanks,
Khurrum
Khurrum Nasim
2016-05-03 14:03:41 UTC
Permalink
Hi Dimitri,

Can you please provide code reference for this in mahout ?

THanks,
Post by Dmitriy Lyubimov
also, mahout does have optimizer that simply decides on degree of
parallelism of the _product_. I.e., if it computes C=A'B then it figures
that final results should be split N ways. but it doesn't apply the
partition function -- it just uses the usual hash partitioner to forward
the keys, i don't think we ever override that.
Post by Dmitriy Lyubimov
by probabilistic algorithms i mostly mean inference involving monte carlo
type mechanisms (Gibbs sampling LDA which i think might still be part of
our MR collection might be an example, as well as its faster counterpart,
variational Bayes inference.
the parallelization strategies are are just standard spark mechanisms (in
case of spark), mostly are using their standard hash samplers (which are in
math speak are uniform multinomial samplers really).
Post by Khurrum Nasim
Hey Dimitri -
Yes I meant probabilistic algorithms. If mahout doesn’t use
probabilistic algos then how does it accomplish a degree of optimal
parallelization ? Wouldn’t you need randomization to spread out the
processing of tasks.
Post by Dmitriy Lyubimov
yes mahout has stochastic svd and pca which are described at length in
the
Post by Dmitriy Lyubimov
samsara book. The book examples in Andrew Palumbo's github also contain
an
Post by Dmitriy Lyubimov
example of computing k-means|| sketch.
if you mean _probabilistic_ algorithms, although i have done some things
outside the public domain, nothing has been contributed.
You are very welcome to try something if you don't have big constraints
on
Post by Dmitriy Lyubimov
oss contribution.
-d
Post by Khurrum Nasim
Hey All,
I’d like to know if Mahout uses any randomized algorithms. I’m
thinking
Post by Dmitriy Lyubimov
Post by Khurrum Nasim
it probably does. Can somebody point me to the packages that utilized
randomized algos.
Thanks,
Khurrum
Khurrum Nasim
2016-05-03 14:04:17 UTC
Permalink
Thank you Andrew and Dimitry for your informed responses.
Post by Dmitriy Lyubimov
also, mahout does have optimizer that simply decides on degree of
parallelism of the _product_. I.e., if it computes C=A'B then it figures
that final results should be split N ways. but it doesn't apply the
partition function -- it just uses the usual hash partitioner to forward
the keys, i don't think we ever override that.
Post by Dmitriy Lyubimov
by probabilistic algorithms i mostly mean inference involving monte carlo
type mechanisms (Gibbs sampling LDA which i think might still be part of
our MR collection might be an example, as well as its faster counterpart,
variational Bayes inference.
the parallelization strategies are are just standard spark mechanisms (in
case of spark), mostly are using their standard hash samplers (which are in
math speak are uniform multinomial samplers really).
Post by Khurrum Nasim
Hey Dimitri -
Yes I meant probabilistic algorithms. If mahout doesn’t use
probabilistic algos then how does it accomplish a degree of optimal
parallelization ? Wouldn’t you need randomization to spread out the
processing of tasks.
Post by Dmitriy Lyubimov
yes mahout has stochastic svd and pca which are described at length in
the
Post by Dmitriy Lyubimov
samsara book. The book examples in Andrew Palumbo's github also contain
an
Post by Dmitriy Lyubimov
example of computing k-means|| sketch.
if you mean _probabilistic_ algorithms, although i have done some things
outside the public domain, nothing has been contributed.
You are very welcome to try something if you don't have big constraints
on
Post by Dmitriy Lyubimov
oss contribution.
-d
Post by Khurrum Nasim
Hey All,
I’d like to know if Mahout uses any randomized algorithms. I’m
thinking
Post by Dmitriy Lyubimov
Post by Khurrum Nasim
it probably does. Can somebody point me to the packages that utilized
randomized algos.
Thanks,
Khurrum
Loading...