Comparing scikit-learn, Mahout Samsara and SystemML

Trevor Grant

2017-06-06 20:41:16 UTC

Hey Gustavo, et. al.

First off- great topic and thank you for moving it here!

Secondly, Matthias- awesome response- helped me too.

Looping in Mahout-dev, as I think this is super productive (and looking
forward to having archived thread to point to).

Hoping D or S (and others) jump in on this too- but I can quickly speak to
a couple of the things from the Mahout side:

SystemML and Mahout (and others) recognize that 1) MLLib/SparkML have major
shortcomings as distributed ML libraries, mainly that they aren't
extensible, and 2) if you're going to 'roll your own algorithms' it would
be best to have a mathematically expressive way to do that ( a programming
language that makes it easy to follow the math, similar to R)

Your two primary criteria:
Both have GPU support- on the Mahout side, there will be a couple of talk
recordings from NVidia's GTC conference explaining this more that will be
available publicly tomorrow- we'll blast them on twitter, or just reach out
and I'll share link.

Both scale well on Apache Spark

To your secondary criteria- I would say fairly matched- on all except:
c - quality of tools for development - advantage Mahout, this stems from
Mahout being Scala based, and therefor being able to leverage all popular
IDEs with Scala support out of the box (code completion, scala docs, etc)
in addition to using other Scala libraries- for instance pre processing
images with scrimage or other SparkML /MLLib utilities, and the pipeline is
all one set of code.

f - quality of documentation- advantage SystemML, we're in the middle of a
website reboot and actively seeking to close this gap, but it is a weak
spot for Mahout right now.

Additionally, I would point out that Mahout, because of Scala based DSL-
will integrate into other programs more easily, from a code perspective.
The contra point- SystemML has much better support for exporting models as
PMMLs, which in a microservices architecture, makes SystemML better for
deploying its models (again- we have an open JIRA for PMML support- but at
the moment, SystemML wins).

Finally I would point out, Mahout is built to be engine neutral. This
allows SystemML to do certain distributed optimizations because it KNOWS it
will be running on Spark. Mahout on the other hand, was built so that you
can change your distributed engine- with no modification to the algorithm
(only the bindings). To write new bindings, one must simply define what is
the distributed structure of the Distributed Row Matrix, and then define
certain operations (like A %*% B, and A.t %*% A) on those distribute
matrices- the point being- it's much easier than porting code to a new
engine. The key here- is if/when Spark falls out of favor- Mahout is going
to be the first on the scene with a robust and powerful machine learning
library, or- if internally you switch engines, you'll find porting your
machine learning much easier with Mahout. This 'feature' seems less
obvious for the 'getting started' user, but is fairly important for the
user with an eye to the long game. Succinctly- the trade-off is
optimization now vs future-proofing your code. The value of this lies a lot
in everyone's personal forecast for the fate of the Apache Spark project ;)

(This neutrality, also supports interesting usecases like hybrid
Spark-batch/Flink-streaming use cases.)

The original post had also asked something about Python vs. Scala. I'm
going to take liberty of chiming in on that too. I think Python is an
excellent language, and worth knowing. I think sklearn is a great ML
package for doing some first stab/playing with the data/prototyping. I
think certain paradigms of Scala make it much better suited to working in
distributed (the way you must express jobs forces your brain to think in
terms of maping and reducing). Even though there are claims here and there
about various python frameworks for being good for distributed ML (pyspark,
and others), none have ever really impressed me- imho manually distributing
sklearn would be a more rewarding experience than using any of them.

My .02, and thanks again!

trevor

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

Thanks for reaching out Gustavo. An objective discussion of how exactly
SystemML and Mahout Samsara compare will probably help other people too. In
order to remove bias, I'm cc'ing Dmitriy and Sebastian from the Samsara
team, so they they can correct me if needed. Scikit-learn is a great and
very popular library of algorithms (which nicely integrates with NumPy),
but I'm excluding it here because it does not focus on large-scale ML.
Fundamentally, both SystemML and Mahout Samsara have a very different
history and represent different points in the design space for custom
large-scale machine learning (ML). Mahout started as a library of
algorithms on Hadoop MapReduce and is, as an overall project, certainly
more mature and a larger community. Samsara itself is a more recent
extension for custom large-scale ML on Spark and Flink. In contrast,
SystemML was build from scratch for custom large-scale ML, originally on
MapReduce and later Spark. After SystemML's initial open source release in
2015, it became just two weeks ago a top-level Apache project and we're
actively working on growing our community.
From a technical perspective, SystemML follows a compiler approach where
scripts with R- or Python-like syntax (but only syntax) are automatically
compiled to hybrid runtime plans, composed of in-memory, singlenode
operations and operations on MapReduce or Spark. At script level, users
work with matrices, frames, and scalars without specifying physical data
properties such as dense/sparse representations, local/distributed storage,
partitioning or caching. The major advantages are (1) the ability to easily
write custom large-scale ML algorithms, (2) automatic adaptation to
different data characteristics (compile distributed operations only if
needed), and simplified deployment (because the same script can be used for
large-scale or local computations).
In contrast, Samsara is a domain-specific language (DSL), embedded in the
host language Scala. Users can either use local matrices or so-called
Distributed Row Matrices (DRM) for distributed computation. Operations over
local matrices are executed as is, without further optimization. In
contrast, operations over DRMs are collected into a DAG of operations and
lazily optimized and executed on triggering actions such as full
aggregations, write, or explicit collect into a local matrix. Hence, the
user is in charge of deciding between local and distributed operations,
caching, and other data flow properties. At the same time, this lower-level
specification allows for more control and the ability to escape to explicit
distributed operations over rows of the DRM if needed.
At compiler and runtime level, there are a number of similarities but also
major differences. For example, both systems provide different physical
operators (for instance, for matrix multiplication), chosen depending on
operation patterns as well as data and cluster characteristics. This
includes local operators, operators for special patterns like t(X)%*%X,
broadcast-based, co-partitioning, and shuffle-based operators.
Additionally, SystemML uses a variety of simplification rewrites, a
different distributed matrix representation of binary block matrices (w/
various dense, sparse, and ultra-sparse formats), and fused operators in
order to reduce scans, intermediates, and exploit sparsity across chains of
operators. Regarding GPUs, we recently added a GPU backend for
deep-learning and generally compute-intensive operations as an experimental
feature in SystemML, and we're actively working on making it
production-ready. I heard that Mahout is similarly working on GPU support
but I am not sure about the details.
To summarize, both SystemML and Samsara aim at different abstraction
levels, and differ substantially in their compiler and runtime internals.
Of course, there are also shared goals and motivations (such as simplifying
custom, large-scale ML), but competition is good as it drives improvements.
I hope this gives a high-level comparison. If you have additional specific
questions, feel free to ask.
Regards,
Matthias
On Mon, Jun 5, 2017 at 6:56 PM, Gustavo Frederico <

Greetings,
I worked with the theory of SVMs during my Graduate studies and Iâm
relatively new to existing ML software. Assuming that I want to create

new

scalable ML algorithms starting with the Math, the question is: how do
scikit-learn, Mahout Samsara and SystemML compare to each other?
I see interesting Python-based frameworks such as scikit-learn, but then

read SystemML's article on Wikipedia that made me question the

distributive

"[...] It was observed that data scientists would write machine learning
algorithms in languages such as R and Python for small data. When it came
time to scale to big data, a systems programmer would be needed to scale
the algorithm in a language such as Scala. This process typically

involved

days or weeks per iteration, and errors would occur translating the
algorithms to operate on big data. " ( https://en.wikipedia.org/wiki/
Apache_SystemML )
And the article starts stating that Apache SystemML has "algorithm
customizability via [...] Python-like languagesâ.
Mahout Samsara is based on Scala. PredictionIO (predictionio.incubator.
apache.org) algorithms are based on Mahout Samsara and Scala. I asked
Mr. Matthias Boehm at a conference how one could compare Mahout Samsara

SystemML. From what I understood, Samsara needs "explicit declarationsâ

expressions for distributed computing, while SystemML doesnât â please
correct me if Iâm wrong. Also, SystemML will optimize the entire script,
while Samsara will optimize expressions â again, please correct me if Iâm
wrong.
While my main criterion is scalability (cluster, GPU support etc), other
criteria to evaluate these frameworks may be: a) public adoption, b)

active

dev community, c) quality of tools for development, d) backing of big
companies e) simplicity working with clusters (delegating the

complexities

of clustering to the framework, âhidingâ them from the user), f) quality

documentation, g) quality of the software itself
( My question was deleted from stats.stackexchange.com for being
off-topic and deleted from Stack Overflow for being bound to get answers
with "opinions rather than factsâ [sic]. Iâm very much interested in
hearing balanced and insightful comments from the list. )
Thank you,
Gustavo