[jira] [Created] (MAHOUT-1863) cluster-syntheticcontrol.sh errors out with "Input path does not exist"

Discussion:

Albert Chu (JIRA)

2016-05-26 00:41:12 UTC

Albert Chu created MAHOUT-1863:
----------------------------------

Summary: cluster-syntheticcontrol.sh errors out with "Input path does not exist"
Key: MAHOUT-1863
URL: https://issues.apache.org/jira/browse/MAHOUT-1863
Project: Mahout
Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Albert Chu
Priority: Minor

Running cluster-syntheticcontrol.sh on 0.12.0 resulted in this error:

{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}

It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch

{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}

one change in question

{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}

now requires that the -p option be specified to -mkdir. This fix is simple.

Another change:

{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}

appears to break the example b/c in:

examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java

the file 'testdata' is hard coded into the example as just 'testdata'.
${WORK_DIR}/testdata needs to be passed in as an option.

Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.

I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.

I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.

So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.

I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.

Github pull request to be sent shortly.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Albert Chu (JIRA)

2016-05-26 00:43:12 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Albert Chu updated MAHOUT-1863:
-------------------------------
Description:
Running cluster-syntheticcontrol.sh on 0.12.0 resulted in this error:

{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}

It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch

{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}

one change in question

{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}

now requires that the -p option be specified to -mkdir. This fix is simple.

Another change:

{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}

appears to break the example b/c in:

examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java

the file 'testdata' is hard coded into the example as just 'testdata'. ${WORK_DIR}/testdata needs to be passed in as an option.

Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.

I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.

I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.

So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.

I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.

Github pull request to be sent shortly.

was:
Running cluster-syntheticcontrol.sh on 0.12.0 resulted in this error:

{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}

It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch

{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}

one change in question

{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}

now requires that the -p option be specified to -mkdir. This fix is simple.

Another change:

{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}

appears to break the example b/c in:

examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java

the file 'testdata' is hard coded into the example as just 'testdata'.
${WORK_DIR}/testdata needs to be passed in as an option.

Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.

I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.

I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.

So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.

I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.

Github pull request to be sent shortly.

Post by Albert Chu (JIRA)
cluster-syntheticcontrol.sh errors out with "Input path does not exist"
-----------------------------------------------------------------------
Key: MAHOUT-1863
URL: https://issues.apache.org/jira/browse/MAHOUT-1863
Project: Mahout
Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Albert Chu
Priority: Minor
{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}
It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch
{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}
one change in question
{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}
now requires that the -p option be specified to -mkdir. This fix is simple.
{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
the file 'testdata' is hard coded into the example as just 'testdata'. ${WORK_DIR}/testdata needs to be passed in as an option.
Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.
I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.
I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.
So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.
I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.
Github pull request to be sent shortly.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Albert Chu (JIRA)

2016-05-26 00:43:12 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Albert Chu updated MAHOUT-1863:
-------------------------------
Description:
Running cluster-syntheticcontrol.sh on 0.12.0 resulted in this error:

{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}

It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch

{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}

one change in question

{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}

now requires that the -p option be specified to -mkdir. This fix is simple.

Another change:

{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}

appears to break the example b/c in:

examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java

the file 'testdata' is hard coded into the example as just 'testdata'. $${WORK_DIR}/testdata needs to be passed in as an option.

Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.

I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.

I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.

So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.

I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.

Github pull request to be sent shortly.

was:
Running cluster-syntheticcontrol.sh on 0.12.0 resulted in this error:

{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}

It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch

{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}

one change in question

{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}

now requires that the -p option be specified to -mkdir. This fix is simple.

Another change:

{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}

appears to break the example b/c in:

examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java

the file 'testdata' is hard coded into the example as just 'testdata'. ${WORK_DIR}/testdata needs to be passed in as an option.

Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.

I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.

I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.

So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.

I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.

Github pull request to be sent shortly.

Post by Albert Chu (JIRA)
cluster-syntheticcontrol.sh errors out with "Input path does not exist"
-----------------------------------------------------------------------
Key: MAHOUT-1863
URL: https://issues.apache.org/jira/browse/MAHOUT-1863
Project: Mahout
Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Albert Chu
Priority: Minor
{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}
It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch
{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}
one change in question
{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}
now requires that the -p option be specified to -mkdir. This fix is simple.
{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
the file 'testdata' is hard coded into the example as just 'testdata'. $${WORK_DIR}/testdata needs to be passed in as an option.
Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.
I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.
I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.
So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.
I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.
Github pull request to be sent shortly.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Albert Chu (JIRA)

2016-05-26 00:43:13 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Albert Chu updated MAHOUT-1863:
-------------------------------
Description:
Running cluster-syntheticcontrol.sh on 0.12.0 resulted in this error:

{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}

It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch

{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}

one change in question

{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}

now requires that the -p option be specified to -mkdir. This fix is simple.

Another change:

{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}

appears to break the example b/c in:

examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java

the file 'testdata' is hard coded into the example as just 'testdata'. ${WORK_DIR}/testdata needs to be passed in as an option.

Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.

I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.

I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.

So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.

I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.

Github pull request to be sent shortly.

was:
Running cluster-syntheticcontrol.sh on 0.12.0 resulted in this error:

{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}

It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch

{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}

one change in question

{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}

now requires that the -p option be specified to -mkdir. This fix is simple.

Another change:

{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}

appears to break the example b/c in:

examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java

the file 'testdata' is hard coded into the example as just 'testdata'. \${WORK_DIR}/testdata needs to be passed in as an option.

Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.

I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.

I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.

So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.

I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.

Github pull request to be sent shortly.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Albert Chu (JIRA)

2016-05-26 00:43:13 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Albert Chu updated MAHOUT-1863:
-------------------------------
Description:
Running cluster-syntheticcontrol.sh on 0.12.0 resulted in this error:

{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}

It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch

{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}

one change in question

{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}

now requires that the -p option be specified to -mkdir. This fix is simple.

Another change:

{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}

appears to break the example b/c in:

examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java

the file 'testdata' is hard coded into the example as just 'testdata'. \${WORK_DIR}/testdata needs to be passed in as an option.

Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.

I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.

I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.

So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.

I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.

Github pull request to be sent shortly.

was:
Running cluster-syntheticcontrol.sh on 0.12.0 resulted in this error:

{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}

It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch

{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}

one change in question

{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}

now requires that the -p option be specified to -mkdir. This fix is simple.

Another change:

{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}

appears to break the example b/c in:

examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java

the file 'testdata' is hard coded into the example as just 'testdata'. $${WORK_DIR}/testdata needs to be passed in as an option.

Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.

I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.

I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.

So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.

I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.

Github pull request to be sent shortly.

Post by Albert Chu (JIRA)
cluster-syntheticcontrol.sh errors out with "Input path does not exist"
-----------------------------------------------------------------------
Key: MAHOUT-1863
URL: https://issues.apache.org/jira/browse/MAHOUT-1863
Project: Mahout
Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Albert Chu
Priority: Minor
{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}
It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch
{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}
one change in question
{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}
now requires that the -p option be specified to -mkdir. This fix is simple.
{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
the file 'testdata' is hard coded into the example as just 'testdata'. \${WORK_DIR}/testdata needs to be passed in as an option.
Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.
I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.
I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.
So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.
I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.
Github pull request to be sent shortly.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Andrew Palumbo (JIRA)

2016-05-26 00:49:12 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301226#comment-15301226 ]

Andrew Palumbo commented on MAHOUT-1863:
----------------------------------------

Thank you for the bug report, [~chu11]! Please do provide a PR with your fix.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2016-05-26 00:56:12 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301232#comment-15301232 ]

ASF GitHub Bot commented on MAHOUT-1863:
----------------------------------------

GitHub user chu11 opened a pull request:

https://github.com/apache/mahout/pull/235

MAHOUT-1863: Several fixes to cluster-syntheticcontrol.sh to fix "Input path does not exist" error

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chu11/mahout MAHOUT-1863

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mahout/pull/235.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #235

----
commit 5b4e49be82ff5baaa36f5bf903c1736a170f6d20
Author: Albert Chu <***@llnl.gov>
Date: 2016-05-25T23:48:34Z

Add -p option to hdfs mkdir in cluster_syntheticcontrol.sh

commit e8d90295ee93d40ef0abdd133fea44677eabd5ca
Author: Albert Chu <***@llnl.gov>
Date: 2016-05-26T00:26:02Z

Pass appropriate options job to ensure cluster_syntheticcontrol.sh works by default

----

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Andrew Palumbo (JIRA)

2016-05-26 00:58:14 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301235#comment-15301235 ]

Andrew Palumbo commented on MAHOUT-1863:
----------------------------------------

I agree that there is probably an easier fix than supplying all of the parameters, this script has been around for a while. IIRC there was a recent change made so that users could provide their own working directory by exporting an env var eg {{export WORK_DIR=/hoime/myworkdir}}, the reason being that some os /tmp directories (the original hard coded default bas dir) were not directly accessible to some users directly or large enough to accommodate large files (i cant remember which it was) but it should have defaulted back to {{/tmp}}. The PR may be an easier place to discuss that. Thanks again.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2016-05-26 01:02:12 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301242#comment-15301242 ]

ASF GitHub Bot commented on MAHOUT-1863:
----------------------------------------

Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/235#discussion_r64677969

--- Diff: examples/bin/cluster-syntheticcontrol.sh ---
@@ -75,11 +75,21 @@ if [ "$HADOOP_HOME" != "" ] && [ "$MAHOUT_LOCAL" == "" ]; then
echo "DFS is healthy... "
echo "Uploading Synthetic control data to HDFS"
$DFSRM ${WORK_DIR}/testdata
- $DFS -mkdir ${WORK_DIR}/testdata
+ $DFS -mkdir -p ${WORK_DIR}/testdata
--- End diff --

the problem with `$DFS -mkdir -p` is that it's not backwards compatible with Hadoop 1, which Mahout still technically supports.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2016-05-26 01:07:13 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301245#comment-15301245 ]

ASF GitHub Bot commented on MAHOUT-1863:
----------------------------------------

Github user chu11 commented on a diff in the pull request:

https://github.com/apache/mahout/pull/235#discussion_r64678305

--- Diff: examples/bin/cluster-syntheticcontrol.sh ---
@@ -75,11 +75,21 @@ if [ "$HADOOP_HOME" != "" ] && [ "$MAHOUT_LOCAL" == "" ]; then
echo "DFS is healthy... "
echo "Uploading Synthetic control data to HDFS"
$DFSRM ${WORK_DIR}/testdata
- $DFS -mkdir ${WORK_DIR}/testdata
+ $DFS -mkdir -p ${WORK_DIR}/testdata
--- End diff --

Wasn't aware that was not compatible with Hadoop 1. However, it's already used in some other examples so those presumably won't work with Hadoop 1 either?

```

grep DFS *.sh | grep mkdir

classify-20newsgroups.sh: $DFS -mkdir -p ${WORK_DIR}
classify-20newsgroups.sh: $DFS -mkdir ${WORK_DIR}/20news-all
classify-wikipedia.sh: $DFS -mkdir -p ${WORK_DIR}
cluster-reuters.sh: $DFS -mkdir -p $WORK_DIR
cluster-reuters.sh: $DFS -mkdir -p ${WORK_DIR}/
cluster-reuters.sh: $DFS -mkdir ${WORK_DIR}/reuters-sgm
cluster-reuters.sh: $DFS -mkdir ${WORK_DIR}/reuters-out
cluster-syntheticcontrol.sh: $DFS -mkdir -p ${WORK_DIR}/testdata
```

cluster-syntheticcontrol.sh errors out with "Input path does not exist"
-----------------------------------------------------------------------
Key: MAHOUT-1863
URL: https://issues.apache.org/jira/browse/MAHOUT-1863
Project: Mahout
Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Albert Chu
Priority: Minor
{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}
It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch
{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}
one change in question
{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}
now requires that the -p option be specified to -mkdir. This fix is simple.
{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
the file 'testdata' is hard coded into the example as just 'testdata'. ${WORK_DIR}/testdata needs to be passed in as an option.
Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.
I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.
I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.
So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.
I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.
Github pull request to be sent shortly.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2016-05-26 01:14:12 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301252#comment-15301252 ]

ASF GitHub Bot commented on MAHOUT-1863:
----------------------------------------

Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/235#discussion_r64678831

--- Diff: examples/bin/cluster-syntheticcontrol.sh ---
@@ -75,11 +75,21 @@ if [ "$HADOOP_HOME" != "" ] && [ "$MAHOUT_LOCAL" == "" ]; then
echo "DFS is healthy... "
echo "Uploading Synthetic control data to HDFS"
$DFSRM ${WORK_DIR}/testdata
- $DFS -mkdir ${WORK_DIR}/testdata
+ $DFS -mkdir -p ${WORK_DIR}/testdata
--- End diff --

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2016-05-26 02:02:12 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301306#comment-15301306 ]

ASF GitHub Bot commented on MAHOUT-1863:
----------------------------------------

Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/235#issuecomment-221758493

The commit that you mention, as well as the `-mkdir -p` introductions looks like it was part of MAHOUT-1794. I've started a dissussion on dev@ about Hadoop 1, So we'll probably see how that ends up before committing a fix for this. Like I said not a deal breaker IMO if the examples are not backwards compatible but we should make an effort.

An easy alternative to `-mkdir -p ${WORK_DIR}/testdata` should just be:
```
$DFS -mkdir /tmp/${WORK_DIR}
$DFS -mkdir /tmp/${WORK_DIR}/testdata
```

That should work with both Hadoop 1 and 2.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2016-05-26 02:52:13 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301344#comment-15301344 ]

ASF GitHub Bot commented on MAHOUT-1863:
----------------------------------------

Github user chu11 commented on the pull request:

https://github.com/apache/mahout/pull/235#issuecomment-221764469

Oh, it ends up MAHOUT-1794 was my patch. I guess those `-mkdir -p`'s were mine :-) The cluster_syntheticcontrol.sh I patched against only created the relative path 'testdata', which is why I never added it in there.

I think your alternate to `-mkdir -p` would only work if WORK_DIR were a directory only one deep long? If it were longer, like `/tmp/foo/bar/` then the `-p` is still needed. Obviously could loop through any path, but that seems like a rat hole no one should go down into.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2016-05-26 16:28:13 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302340#comment-15302340 ]

ASF GitHub Bot commented on MAHOUT-1863:
----------------------------------------

Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/235#issuecomment-221923050

Oh yes- I'd forgotten that we're allowing for user-defined directories now in these scripts so we don't know how long the path will be. So that simple alternative won't work without as you mentioned a loop- these scripts are already complicated enough.. We've discussed tearing them down completely and re-doing them but haven't had a chance (Would you be interested? :))

So I'll have to test this out but I'm for committing this as is. It needs to at least be working on hadoop 2.

We can then look at getting all the scripts back to hadoop 1 compatible later.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2016-05-26 17:19:12 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302505#comment-15302505 ]

ASF GitHub Bot commented on MAHOUT-1863:
----------------------------------------

Github user chu11 commented on the pull request:

https://github.com/apache/mahout/pull/235#issuecomment-221936067

I certainly don't mind helping out. If anything it's an opportunity to dig into Mahout more :-)

I think some fixes/cleanup to the java files would be good too. Originally I hacked up Kmeans.java to have code to handle a whole bunch of default situations, i.e. things like this.

```
- int maxIterations = Integer.parseInt(getOption(DefaultOptionCreator.MAX_ITERATIONS_OPTION));
+ int maxIterations;
+ if (hasOption(DefaultOptionCreator.MAX_ITERATIONS_OPTION)) {
+ maxIterations = Integer.parseInt(getOption(DefaultOptionCreator.MAX_ITERATIONS_OPTION));
+ }
+ else {
+ maxIterations = 10;
+ }
```

But eventually gave up when I realized:

A) The example required either all of the arguments or none.

B) When handling defaults, a collection of options were required to be passed in by the user. For example `t1` & `t2` are required to be passed in even though in Kmeans you can just specify number of clusters.

C) I didn't want to go down the rat hole of figuring out what to change in DefaultOptionCreator.java and what would be ok w/ the rest of the examples.

And there's cleanup stuff too (e.g. not hard coding defaults like I did above, making some constant somewhere for that).

So is the Mahout team interested in a wide range of "cleanup" kind of patches?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2016-05-26 19:10:13 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302710#comment-15302710 ]

ASF GitHub Bot commented on MAHOUT-1863:
----------------------------------------

Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/235#issuecomment-221965587

We are actually not doing anything in the way of MapReduce anymore. Since Mahout 0.9: MAHOUT-1510, we're doing all of the our new work in what we've called the Mahout "Samsara" environment: http://mahout.apache.org/users/sparkbindings/home.html

aka "Mahout on Spark", "Mahout on Flink", "Mahout on H2O", etc..

Many of the committers responsible for maintaining the MapReduce code are not currently active, making it hard to review patches.

That being said, There may be some interest in clean up of certain MapReduce algorithms maintained by the maintainers that are still around if they you have some obvious bug fixes that you've found. Maybe you could shoot an email to dev@ if you have some in mind?

If you'd like to get more involved in Mahout, you'll find yourself very welcomed! Your time would probably be better spent working on the new framework. We have a good amount of JIRAs started for the next (0.13.0) release and will be adding more, as we just finished up the milestone 0.12.x releases.

Thank you again for the patch!

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2016-05-26 23:56:12 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303203#comment-15303203 ]

ASF GitHub Bot commented on MAHOUT-1863:
----------------------------------------

Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/235

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Andrew Palumbo (JIRA)

2016-05-26 23:57:12 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Palumbo resolved MAHOUT-1863.
------------------------------------
Resolution: Fixed
Assignee: Andrew Palumbo
Fix Version/s: 0.13.0

thanks [~chu11]!

Post by Albert Chu (JIRA)
cluster-syntheticcontrol.sh errors out with "Input path does not exist"
-----------------------------------------------------------------------
Key: MAHOUT-1863
URL: https://issues.apache.org/jira/browse/MAHOUT-1863
Project: Mahout
Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Albert Chu
Assignee: Andrew Palumbo
Priority: Minor
Fix For: 0.13.0
{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}
It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch
{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}
one change in question
{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}
now requires that the -p option be specified to -mkdir. This fix is simple.
{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
the file 'testdata' is hard coded into the example as just 'testdata'. ${WORK_DIR}/testdata needs to be passed in as an option.
Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.
I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.
I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.
So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.
I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.
Github pull request to be sent shortly.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Hudson (JIRA)

2016-05-27 00:46:12 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303269#comment-15303269 ]

Hudson commented on MAHOUT-1863:
--------------------------------

SUCCESS: Integrated in Mahout-Quality #3359 (See [https://builds.apache.org/job/Mahout-Quality/3359/])
MAHOUT-1863: Several fixes to cluster-syntheticcontrol.sh to fix Input (apalumbo: rev 1f3566d358d94a6e6a868cd74a83a553facea355)
* examples/bin/cluster-syntheticcontrol.sh

Post by Albert Chu (JIRA)
cluster-syntheticcontrol.sh errors out with "Input path does not exist"
-----------------------------------------------------------------------
Key: MAHOUT-1863
URL: https://issues.apache.org/jira/browse/MAHOUT-1863
Project: Mahout
Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Albert Chu
Assignee: Andrew Palumbo
Priority: Minor
Fix For: 0.13.0
{noformat}
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://apex156:54310/user/achu/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
at org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{noformat}
It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch
{noformat}
commit 23267a0bef064f3351fd879274724bcb02333c4a
{noformat}
one change in question
{noformat}
- $DFS -mkdir testdata
+ $DFS -mkdir ${WORK_DIR}/testdata
{noformat}
now requires that the -p option be specified to -mkdir. This fix is simple.
{noformat}
- $DFS -put ${WORK_DIR}/synthetic_control.data testdata
+ $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
{noformat}
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
the file 'testdata' is hard coded into the example as just 'testdata'. ${WORK_DIR}/testdata needs to be passed in as an option.
Reverting the lines listed above fixes the problem. However, the reverting presumably breaks the original problem listed in MAHOUT-1773.
I originally attempted to fix this by simply passing in the option "--input ${WORK_DIR}/testdata" into the command in the script. However, a number of other options are required if one option is specified.
I considered modifying the above Job.java files to take a minimal number of arguments and set the rest to some default, but that would have also required changes to DefaultOptionCreator.java to make required options non-optional, which I didn't want to go down the path of determining what other examples had requires/non-requires requirements.
So I just passed in every required option into cluster-syntheticcontrol.sh to fix this, using whatever defaults were hard coded into the Job.java files above.
I'm sure there's a better way to do this, and I'm happy to supply a patch, but thought I'd start with this.
Github pull request to be sent shortly.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)