[jira] [Created] (MAHOUT-1884) Allow specification of dimensions of a DRM

Discussion:

Sebastian Schelter (JIRA)

2016-10-03 06:51:20 UTC

Sebastian Schelter created MAHOUT-1884:
------------------------------------------

Summary: Allow specification of dimensions of a DRM
Key: MAHOUT-1884
URL: https://issues.apache.org/jira/browse/MAHOUT-1884
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.2
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
Priority: Minor

Currently, in many cases, a DRM must be read to compute its dimensions when a user calls nrow or ncol. This also implicitly caches the corresponding DRM.

In some cases, the user actually knows the matrix dimensions (e.g., when the matrices are synthetically generated, or when some metadata about them is known). In such cases, the user should be able to specify the dimensions upon creating the DRM and the caching should be avoided.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Dmitriy Lyubimov

2016-10-03 21:00:05 UTC

Permalink

this has been covered by drwWrap() signature from the very beginning.
I vote this as non-issue.

Post by Sebastian Schelter (JIRA)
------------------------------------------
Summary: Allow specification of dimensions of a DRM
Key: MAHOUT-1884
URL: https://issues.apache.org/jira/browse/MAHOUT-1884
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.2
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
Priority: Minor
Currently, in many cases, a DRM must be read to compute its dimensions
when a user calls nrow or ncol. This also implicitly caches the
corresponding DRM.
In some cases, the user actually knows the matrix dimensions (e.g., when
the matrices are synthetically generated, or when some metadata about them
is known). In such cases, the user should be able to specify the dimensions
upon creating the DRM and the caching should be avoided.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Dmitriy Lyubimov (JIRA)

2016-10-03 21:12:20 UTC

Permalink

[ https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543437#comment-15543437 ]

Dmitriy Lyubimov commented on MAHOUT-1884:
------------------------------------------

Which api is this about specifically?

wrapping existing RDD (drmWrap() api) supports this.

Also note that for drms off disk, these are one-pass computations that are of cost no more than RDD$count(). Since for any dataset we call dfsRead(), the obvious intent is to use it, loading & caching is not doing any harm as that's what would happen anyway.

also, matrix dimensions are the most obvious ones but not everything that optimizer may need to analyze about the dataset (lazily). There are more heuristics about datasets that drmWrap() accepts (and even more that it doesn't).

if we are talking about cases where drmWrap() cannot be used for some reason, we probably should request metadata equivalent to what drmWrap() does, not just ncol, nrow.

Post by Sebastian Schelter (JIRA)
Allow specification of dimensions of a DRM
------------------------------------------
Key: MAHOUT-1884
URL: https://issues.apache.org/jira/browse/MAHOUT-1884
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.12.2
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
Priority: Minor
Currently, in many cases, a DRM must be read to compute its dimensions when a user calls nrow or ncol. This also implicitly caches the corresponding DRM.
In some cases, the user actually knows the matrix dimensions (e.g., when the matrices are synthetically generated, or when some metadata about them is known). In such cases, the user should be able to specify the dimensions upon creating the DRM and the caching should be avoided.