Andrew Musselman
2017-03-01 23:27:22 UTC
Hi Shengfa, thanks for reaching out; I'm forwarding to the user and dev
lists so more people can take a look.
We're in the middle of a release this week so responses might be a bit
delayed, but we'll help however we can.
Thanks
---------- Forwarded message ----------
From: Shengfa Lin <***@morningstar.com>
Date: Wed, Mar 1, 2017 at 2:24 PM
Subject: Mahout Compatibility With Hortonworks Sandbox
To: "***@gmail.com" <***@gmail.com>
Hi Andrew,
I am a software developer from Morningstar. I am currently working on a
project to migrate our Mahout pipeline from Cloudera to Hortonworks and
also use the built-in spark functionality from Mahout.
I saw there is an example that is going to be really helpful if I could get
the result on my sandbox, classify-20newsgroups.sh with option 3 which is
to run complementary naïve Bayes with mahout spark-trainnb.
However, I am getting Exception in thread "main"
java.util.ServiceConfigurationError:
org.apache.hadoop.fs.FileSystem: Provider
org.apache.hadoop.fs.s3a.S3AFileSystem
which after searching on the internet I think itâs a classpath issue.
The steps I have taken so far are as followed,
1. https://hortonworks.com/downloads/#sandbox, downloaded hortonworks
sandbox for virtual box which has Hadoop 2.7.3, Hadoop hdfs and spark 1.6.2
on it (https://hortonworks.com/hadoop-tutorial/learning-the-
ropes-of-the-hortonworks-sandbox/)
2. Downloaded mahout distribution from http://archive.apache.org/
dist/mahout/0.12.2/ (apache-mahout-distribution-0.12.2.tar.gz
<http://archive.apache.org/dist/mahout/0.12.2/apache-mahout-distribution-0.12.2.tar.gz>
)
3. After unpacking the mahout tar in home directory of the sandbox,
then I setup the necessary environment variables
export MAHOUT_HOME=~/mahout
export HADOOP_HOME=/usr/hdp/current/hadoop-client
export SPARK_HOME=/usr/hdp/current/spark-client
4. Then under hortonworks sandbox provided user,
/home/maria_dev/mahout/examples/bin
Executed *bash classify-20newsgroups.sh* by downloading and creating the
data file manually.
Chose 3. cnaivebayes-Spark
And resulted in detail
âŠ
Running on hadoop, using /usr/hdp/current/hadoop-client/bin/hadoop and
HADOOP_CONF_DIR=
MAHOUT-JOB: /home/maria_dev/mahout/mahout-examples-0.12.2-job.jar
17/03/01 08:44:10 WARN MahoutDriver: No split.props found on classpath,
will use command-line arguments only
17/03/01 08:44:10 INFO AbstractJob: Command line arguments: {--endPhase=
[2147483647 <(214)%20748-3647>], --input=[/tmp/mahout-work-
maria_dev/20news-vectors/tfidf-vectors], --method=[sequential],
--overwrite=null, --randomSelectionPct=[40], --sequenceFiles=null,
--startPhase=[0], --tempDir=[temp], --testOutput=[/tmp/mahout-
work-maria_dev/20news-test-vectors], --trainingOutput=[/tmp/mahout-
work-maria_dev/20news-train-vectors]}
17/03/01 08:44:11 INFO HadoopUtil: Deleting /tmp/mahout-work-maria_dev/
20news-train-vectors
17/03/01 08:44:11 INFO HadoopUtil: Deleting /tmp/mahout-work-maria_dev/
20news-test-vectors
17/03/01 08:44:12 INFO SplitInput: part-r-00000 has 162419 lines
17/03/01 08:44:12 INFO SplitInput: part-r-00000 test split size is 64968
based on random selection percentage 40
17/03/01 08:44:12 INFO ZlibFactory: Successfully loaded & initialized
native-zlib library
17/03/01 08:44:12 INFO CodecPool: Got brand-new compressor [.deflate]
17/03/01 08:44:12 INFO CodecPool: Got brand-new compressor [.deflate]
17/03/01 08:44:15 INFO SplitInput: file: part-r-00000, input: 162419 train:
11372, test: 7474 starting at 0
17/03/01 08:44:15 INFO MahoutDriver: Program took 5598 ms (Minutes: 0.0933)
+ '[' xcnaivebayes-Spark == xnaivebayes-MapReduce -o xcnaivebayes-Spark ==
xcnaivebayes-MapReduce ']'
+ '[' xcnaivebayes-Spark == xnaivebayes-Spark -o xcnaivebayes-Spark ==
xcnaivebayes-Spark ']'
+ echo 'Training Naive Bayes model'
Training Naive Bayes model
+ ./bin/mahout spark-trainnb -i /tmp/mahout-work-maria_dev/20news-train-vectors
-o /tmp/mahout-work-maria_dev/spark-model -ow -ma spark://
sandbox.hortonworks.com:7077
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/maria_dev/
mahout/mahout-examples-0.12.2-job.jar!/org/slf4j/impl/
StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/maria_dev/
mahout/mahout-mr-0.12.2-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.5.0.0-
1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.
3.2.5.0.0-1245.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/maria_dev/
mahout/lib/slf4j-log4j12-1.7.19.jar!/org/slf4j/impl/
StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/03/01 08:44:18 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.
17/03/01 08:44:18 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.
17/03/01 08:44:19 INFO SparkContext: Running Spark version 1.6.2
17/03/01 08:44:19 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.
17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.
17/03/01 08:44:19 INFO SecurityManager: Changing view acls to: maria_dev
17/03/01 08:44:19 INFO SecurityManager: Changing modify acls to: maria_dev
17/03/01 08:44:19 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(maria_dev);
users with modify permissions: Set(maria_dev)
17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.
17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.
17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.
17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.
17/03/01 08:44:20 INFO Utils: Successfully started service 'sparkDriver' on
port 38386.
17/03/01 08:44:20 INFO Slf4jLogger: Slf4jLogger started
17/03/01 08:44:20 INFO Remoting: Starting remoting
17/03/01 08:44:20 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://***@172.17.0.2:47072]
17/03/01 08:44:20 INFO Utils: Successfully started service
'sparkDriverActorSystem' on port 47072.
17/03/01 08:44:20 INFO SparkEnv: Registering MapOutputTracker
17/03/01 08:44:20 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.
17/03/01 08:44:20 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.
17/03/01 08:44:20 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.
17/03/01 08:44:20 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.
17/03/01 08:44:20 INFO SparkEnv: Registering BlockManagerMaster
17/03/01 08:44:20 INFO DiskBlockManager: Created local directory at
/tmp/blockmgr-62b0388f-90a5-407c-bba3-975e4f5e0c81
17/03/01 08:44:20 INFO MemoryStore: MemoryStore started with capacity 2.4 GB
17/03/01 08:44:20 INFO SparkEnv: Registering OutputCommitCoordinator
17/03/01 08:44:21 INFO Server: jetty-8.y.z-SNAPSHOT
17/03/01 08:44:21 INFO AbstractConnector: Started
***@0.0.0.0:4040
17/03/01 08:44:21 INFO Utils: Successfully started service 'SparkUI' on
port 4040.
17/03/01 08:44:21 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
http://172.17.0.2:4040
17/03/01 08:44:21 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-62bda9c4-377b-44f8-88a4-2fd628be7c43/httpd-
7663f921-6ea3-4fa1-999b-bb8662635679
17/03/01 08:44:21 INFO HttpServer: Starting HTTP Server
17/03/01 08:44:21 INFO Server: jetty-8.y.z-SNAPSHOT
17/03/01 08:44:21 INFO AbstractConnector: Started
***@0.0.0.0:33328
17/03/01 08:44:21 INFO Utils: Successfully started service 'HTTP file
server' on port 33328.
17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-hdfs-0.12.2.jar at
http://172.17.0.2:33328/jars/mahout-hdfs-0.12.2.jar with timestamp
1488357861107
17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-math-0.12.2.jar at
http://172.17.0.2:33328/jars/mahout-math-0.12.2.jar with timestamp
1488357861112
17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-math-scala_2.10-0.12.2.jar at
http://172.17.0.2:33328/jars/mahout-math-scala_2.10-0.12.2.jar with
timestamp 1488357861113
17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-spark_2.10-0.12.2-dependency-reduced.jar at
http://172.17.0.2:33328/jars/mahout-spark_2.10-0.12.2-dependency-reduced.jar
with timestamp 1488357861176
17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-spark_2.10-0.12.2.jar at
http://172.17.0.2:33328/jars/mahout-spark_2.10-0.12.2.jar with timestamp
1488357861177
17/03/01 08:44:21 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.
17/03/01 08:44:21 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.
17/03/01 08:44:21 INFO AppClient$ClientEndpoint: Connecting to master
spark://sandbox.hortonworks.com:7077...
17/03/01 08:44:21 INFO SparkDeploySchedulerBackend: Connected to Spark
cluster with app ID app-20170301084421-0000
17/03/01 08:44:21 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 47552.
17/03/01 08:44:21 INFO NettyBlockTransferService: Server created on 47552
17/03/01 08:44:21 INFO BlockManagerMaster: Trying to register BlockManager
17/03/01 08:44:21 INFO BlockManagerMasterEndpoint: Registering block
manager 172.17.0.2:47552 with 2.4 GB RAM, BlockManagerId(driver,
172.17.0.2, 47552)
17/03/01 08:44:21 INFO BlockManagerMaster: Registered BlockManager
17/03/01 08:44:22 INFO SparkDeploySchedulerBackend: SchedulerBackend is
ready for scheduling beginning after reached minRegisteredResourcesRatio:
0.0
Exception in thread "main" java.util.ServiceConfigurationError:
org.apache.hadoop.fs.FileSystem: Provider
org.apache.hadoop.fs.s3a.S3AFileSystem
could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:
185)
at java.util.ServiceLoader$LazyIterator.nextService(
ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(
ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(
FileSystem.java:2364)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(
FileSystem.java:2375)
at org.apache.hadoop.fs.FileSystem.createFileSystem(
FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(
FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(
FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:352)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.mahout.common.Hadoop1HDFSUtil$.delete(
Hadoop1HDFSUtil.scala:76)
at org.apache.mahout.drivers.TrainNBDriver$.process(
TrainNBDriver.scala:98)
at org.apache.mahout.drivers.TrainNBDriver$$anonfun$main$1.
apply(TrainNBDriver.scala:76)
at org.apache.mahout.drivers.TrainNBDriver$$anonfun$main$1.
apply(TrainNBDriver.scala:74)
at scala.Option.map(Option.scala:145)
at org.apache.mahout.drivers.TrainNBDriver$.main(
TrainNBDriver.scala:74)
at org.apache.mahout.drivers.TrainNBDriver.main(
TrainNBDriver.scala)
Caused by: java.lang.NoClassDefFoundError: com/amazonaws/
AmazonClientException
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors
(Class.java:2671)
at java.lang.Class.getConstructor0(Class.java:3075)
at java.lang.Class.newInstance(Class.java:412)
at java.util.ServiceLoader$LazyIterator.nextService(
ServiceLoader.java:380)
... 19 more
Caused by: java.lang.ClassNotFoundException: com.amazonaws.
AmazonClientException
at java.net.URLClassLoader.findClass(URLClassLoader.java:
381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(
Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 24 more
17/03/01 08:44:22 INFO SparkContext: Invoking stop() from shutdown hook
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/metrics/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/api,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/static,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/threadDump,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/environment/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/environment,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/rdd,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/pool/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/pool,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/job/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/job,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs,null}
17/03/01 08:44:22 INFO SparkUI: Stopped Spark web UI at
http://172.17.0.2:4040
17/03/01 08:44:22 INFO SparkDeploySchedulerBackend: Shutting down all
executors
17/03/01 08:44:22 INFO SparkDeploySchedulerBackend: Asking each executor to
shut down
17/03/01 08:44:22 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
17/03/01 08:44:22 INFO MemoryStore: MemoryStore cleared
17/03/01 08:44:22 INFO BlockManager: BlockManager stopped
17/03/01 08:44:22 INFO BlockManagerMaster: BlockManagerMaster stopped
17/03/01 08:44:22 INFO OutputCommitCoordinator$
OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/03/01 08:44:22 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
down remote daemon.
17/03/01 08:44:22 INFO RemoteActorRefProvider$RemotingTerminator: Remote
daemon shut down; proceeding with flushing remote transports.
17/03/01 08:44:22 INFO SparkContext: Successfully stopped SparkContext
17/03/01 08:44:22 INFO ShutdownHookManager: Shutdown hook called
17/03/01 08:44:22 INFO ShutdownHookManager: Deleting directory
/tmp/spark-62bda9c4-377b-44f8-88a4-2fd628be7c43/httpd-
7663f921-6ea3-4fa1-999b-bb8662635679
17/03/01 08:44:22 INFO ShutdownHookManager: Deleting directory
/tmp/spark-62bda9c4-377b-44f8-88a4-2fd628be7c43
Could you please guide on how to run the specific example?
Thanks,
Shengfa
lists so more people can take a look.
We're in the middle of a release this week so responses might be a bit
delayed, but we'll help however we can.
Thanks
---------- Forwarded message ----------
From: Shengfa Lin <***@morningstar.com>
Date: Wed, Mar 1, 2017 at 2:24 PM
Subject: Mahout Compatibility With Hortonworks Sandbox
To: "***@gmail.com" <***@gmail.com>
Hi Andrew,
I am a software developer from Morningstar. I am currently working on a
project to migrate our Mahout pipeline from Cloudera to Hortonworks and
also use the built-in spark functionality from Mahout.
I saw there is an example that is going to be really helpful if I could get
the result on my sandbox, classify-20newsgroups.sh with option 3 which is
to run complementary naïve Bayes with mahout spark-trainnb.
However, I am getting Exception in thread "main"
java.util.ServiceConfigurationError:
org.apache.hadoop.fs.FileSystem: Provider
org.apache.hadoop.fs.s3a.S3AFileSystem
which after searching on the internet I think itâs a classpath issue.
The steps I have taken so far are as followed,
1. https://hortonworks.com/downloads/#sandbox, downloaded hortonworks
sandbox for virtual box which has Hadoop 2.7.3, Hadoop hdfs and spark 1.6.2
on it (https://hortonworks.com/hadoop-tutorial/learning-the-
ropes-of-the-hortonworks-sandbox/)
2. Downloaded mahout distribution from http://archive.apache.org/
dist/mahout/0.12.2/ (apache-mahout-distribution-0.12.2.tar.gz
<http://archive.apache.org/dist/mahout/0.12.2/apache-mahout-distribution-0.12.2.tar.gz>
)
3. After unpacking the mahout tar in home directory of the sandbox,
then I setup the necessary environment variables
export MAHOUT_HOME=~/mahout
export HADOOP_HOME=/usr/hdp/current/hadoop-client
export SPARK_HOME=/usr/hdp/current/spark-client
4. Then under hortonworks sandbox provided user,
/home/maria_dev/mahout/examples/bin
Executed *bash classify-20newsgroups.sh* by downloading and creating the
data file manually.
Chose 3. cnaivebayes-Spark
And resulted in detail
âŠ
Running on hadoop, using /usr/hdp/current/hadoop-client/bin/hadoop and
HADOOP_CONF_DIR=
MAHOUT-JOB: /home/maria_dev/mahout/mahout-examples-0.12.2-job.jar
17/03/01 08:44:10 WARN MahoutDriver: No split.props found on classpath,
will use command-line arguments only
17/03/01 08:44:10 INFO AbstractJob: Command line arguments: {--endPhase=
[2147483647 <(214)%20748-3647>], --input=[/tmp/mahout-work-
maria_dev/20news-vectors/tfidf-vectors], --method=[sequential],
--overwrite=null, --randomSelectionPct=[40], --sequenceFiles=null,
--startPhase=[0], --tempDir=[temp], --testOutput=[/tmp/mahout-
work-maria_dev/20news-test-vectors], --trainingOutput=[/tmp/mahout-
work-maria_dev/20news-train-vectors]}
17/03/01 08:44:11 INFO HadoopUtil: Deleting /tmp/mahout-work-maria_dev/
20news-train-vectors
17/03/01 08:44:11 INFO HadoopUtil: Deleting /tmp/mahout-work-maria_dev/
20news-test-vectors
17/03/01 08:44:12 INFO SplitInput: part-r-00000 has 162419 lines
17/03/01 08:44:12 INFO SplitInput: part-r-00000 test split size is 64968
based on random selection percentage 40
17/03/01 08:44:12 INFO ZlibFactory: Successfully loaded & initialized
native-zlib library
17/03/01 08:44:12 INFO CodecPool: Got brand-new compressor [.deflate]
17/03/01 08:44:12 INFO CodecPool: Got brand-new compressor [.deflate]
17/03/01 08:44:15 INFO SplitInput: file: part-r-00000, input: 162419 train:
11372, test: 7474 starting at 0
17/03/01 08:44:15 INFO MahoutDriver: Program took 5598 ms (Minutes: 0.0933)
+ '[' xcnaivebayes-Spark == xnaivebayes-MapReduce -o xcnaivebayes-Spark ==
xcnaivebayes-MapReduce ']'
+ '[' xcnaivebayes-Spark == xnaivebayes-Spark -o xcnaivebayes-Spark ==
xcnaivebayes-Spark ']'
+ echo 'Training Naive Bayes model'
Training Naive Bayes model
+ ./bin/mahout spark-trainnb -i /tmp/mahout-work-maria_dev/20news-train-vectors
-o /tmp/mahout-work-maria_dev/spark-model -ow -ma spark://
sandbox.hortonworks.com:7077
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/maria_dev/
mahout/mahout-examples-0.12.2-job.jar!/org/slf4j/impl/
StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/maria_dev/
mahout/mahout-mr-0.12.2-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.5.0.0-
1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.
3.2.5.0.0-1245.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/maria_dev/
mahout/lib/slf4j-log4j12-1.7.19.jar!/org/slf4j/impl/
StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/03/01 08:44:18 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.
17/03/01 08:44:18 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.
17/03/01 08:44:19 INFO SparkContext: Running Spark version 1.6.2
17/03/01 08:44:19 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.
17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.
17/03/01 08:44:19 INFO SecurityManager: Changing view acls to: maria_dev
17/03/01 08:44:19 INFO SecurityManager: Changing modify acls to: maria_dev
17/03/01 08:44:19 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(maria_dev);
users with modify permissions: Set(maria_dev)
17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.
17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.
17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.
17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.
17/03/01 08:44:20 INFO Utils: Successfully started service 'sparkDriver' on
port 38386.
17/03/01 08:44:20 INFO Slf4jLogger: Slf4jLogger started
17/03/01 08:44:20 INFO Remoting: Starting remoting
17/03/01 08:44:20 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://***@172.17.0.2:47072]
17/03/01 08:44:20 INFO Utils: Successfully started service
'sparkDriverActorSystem' on port 47072.
17/03/01 08:44:20 INFO SparkEnv: Registering MapOutputTracker
17/03/01 08:44:20 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.
17/03/01 08:44:20 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.
17/03/01 08:44:20 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.
17/03/01 08:44:20 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.
17/03/01 08:44:20 INFO SparkEnv: Registering BlockManagerMaster
17/03/01 08:44:20 INFO DiskBlockManager: Created local directory at
/tmp/blockmgr-62b0388f-90a5-407c-bba3-975e4f5e0c81
17/03/01 08:44:20 INFO MemoryStore: MemoryStore started with capacity 2.4 GB
17/03/01 08:44:20 INFO SparkEnv: Registering OutputCommitCoordinator
17/03/01 08:44:21 INFO Server: jetty-8.y.z-SNAPSHOT
17/03/01 08:44:21 INFO AbstractConnector: Started
***@0.0.0.0:4040
17/03/01 08:44:21 INFO Utils: Successfully started service 'SparkUI' on
port 4040.
17/03/01 08:44:21 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
http://172.17.0.2:4040
17/03/01 08:44:21 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-62bda9c4-377b-44f8-88a4-2fd628be7c43/httpd-
7663f921-6ea3-4fa1-999b-bb8662635679
17/03/01 08:44:21 INFO HttpServer: Starting HTTP Server
17/03/01 08:44:21 INFO Server: jetty-8.y.z-SNAPSHOT
17/03/01 08:44:21 INFO AbstractConnector: Started
***@0.0.0.0:33328
17/03/01 08:44:21 INFO Utils: Successfully started service 'HTTP file
server' on port 33328.
17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-hdfs-0.12.2.jar at
http://172.17.0.2:33328/jars/mahout-hdfs-0.12.2.jar with timestamp
1488357861107
17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-math-0.12.2.jar at
http://172.17.0.2:33328/jars/mahout-math-0.12.2.jar with timestamp
1488357861112
17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-math-scala_2.10-0.12.2.jar at
http://172.17.0.2:33328/jars/mahout-math-scala_2.10-0.12.2.jar with
timestamp 1488357861113
17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-spark_2.10-0.12.2-dependency-reduced.jar at
http://172.17.0.2:33328/jars/mahout-spark_2.10-0.12.2-dependency-reduced.jar
with timestamp 1488357861176
17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-spark_2.10-0.12.2.jar at
http://172.17.0.2:33328/jars/mahout-spark_2.10-0.12.2.jar with timestamp
1488357861177
17/03/01 08:44:21 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.
17/03/01 08:44:21 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.
17/03/01 08:44:21 INFO AppClient$ClientEndpoint: Connecting to master
spark://sandbox.hortonworks.com:7077...
17/03/01 08:44:21 INFO SparkDeploySchedulerBackend: Connected to Spark
cluster with app ID app-20170301084421-0000
17/03/01 08:44:21 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 47552.
17/03/01 08:44:21 INFO NettyBlockTransferService: Server created on 47552
17/03/01 08:44:21 INFO BlockManagerMaster: Trying to register BlockManager
17/03/01 08:44:21 INFO BlockManagerMasterEndpoint: Registering block
manager 172.17.0.2:47552 with 2.4 GB RAM, BlockManagerId(driver,
172.17.0.2, 47552)
17/03/01 08:44:21 INFO BlockManagerMaster: Registered BlockManager
17/03/01 08:44:22 INFO SparkDeploySchedulerBackend: SchedulerBackend is
ready for scheduling beginning after reached minRegisteredResourcesRatio:
0.0
Exception in thread "main" java.util.ServiceConfigurationError:
org.apache.hadoop.fs.FileSystem: Provider
org.apache.hadoop.fs.s3a.S3AFileSystem
could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:
185)
at java.util.ServiceLoader$LazyIterator.nextService(
ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(
ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(
FileSystem.java:2364)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(
FileSystem.java:2375)
at org.apache.hadoop.fs.FileSystem.createFileSystem(
FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(
FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(
FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:352)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.mahout.common.Hadoop1HDFSUtil$.delete(
Hadoop1HDFSUtil.scala:76)
at org.apache.mahout.drivers.TrainNBDriver$.process(
TrainNBDriver.scala:98)
at org.apache.mahout.drivers.TrainNBDriver$$anonfun$main$1.
apply(TrainNBDriver.scala:76)
at org.apache.mahout.drivers.TrainNBDriver$$anonfun$main$1.
apply(TrainNBDriver.scala:74)
at scala.Option.map(Option.scala:145)
at org.apache.mahout.drivers.TrainNBDriver$.main(
TrainNBDriver.scala:74)
at org.apache.mahout.drivers.TrainNBDriver.main(
TrainNBDriver.scala)
Caused by: java.lang.NoClassDefFoundError: com/amazonaws/
AmazonClientException
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors
(Class.java:2671)
at java.lang.Class.getConstructor0(Class.java:3075)
at java.lang.Class.newInstance(Class.java:412)
at java.util.ServiceLoader$LazyIterator.nextService(
ServiceLoader.java:380)
... 19 more
Caused by: java.lang.ClassNotFoundException: com.amazonaws.
AmazonClientException
at java.net.URLClassLoader.findClass(URLClassLoader.java:
381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(
Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 24 more
17/03/01 08:44:22 INFO SparkContext: Invoking stop() from shutdown hook
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/metrics/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/api,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/static,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/threadDump,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/environment/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/environment,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/rdd,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/pool/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/pool,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/job/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/job,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/json,null}
17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs,null}
17/03/01 08:44:22 INFO SparkUI: Stopped Spark web UI at
http://172.17.0.2:4040
17/03/01 08:44:22 INFO SparkDeploySchedulerBackend: Shutting down all
executors
17/03/01 08:44:22 INFO SparkDeploySchedulerBackend: Asking each executor to
shut down
17/03/01 08:44:22 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
17/03/01 08:44:22 INFO MemoryStore: MemoryStore cleared
17/03/01 08:44:22 INFO BlockManager: BlockManager stopped
17/03/01 08:44:22 INFO BlockManagerMaster: BlockManagerMaster stopped
17/03/01 08:44:22 INFO OutputCommitCoordinator$
OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/03/01 08:44:22 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
down remote daemon.
17/03/01 08:44:22 INFO RemoteActorRefProvider$RemotingTerminator: Remote
daemon shut down; proceeding with flushing remote transports.
17/03/01 08:44:22 INFO SparkContext: Successfully stopped SparkContext
17/03/01 08:44:22 INFO ShutdownHookManager: Shutdown hook called
17/03/01 08:44:22 INFO ShutdownHookManager: Deleting directory
/tmp/spark-62bda9c4-377b-44f8-88a4-2fd628be7c43/httpd-
7663f921-6ea3-4fa1-999b-bb8662635679
17/03/01 08:44:22 INFO ShutdownHookManager: Deleting directory
/tmp/spark-62bda9c4-377b-44f8-88a4-2fd628be7c43
Could you please guide on how to run the specific example?
Thanks,
Shengfa