[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302371#comment-15302371 ]
Pat Ferrel edited comment on MAHOUT-1853 at 5/26/16 5:04 PM:
-------------------------------------------------------------
Steps:
1) allow an array of absolute LLR value thresholds, one for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that non-correlation is rejected) or fraction of total cross-occurrences are retained after downsampling. To reduce how often this must be done the absolute value thresholds should be output after calculation for later re-use in #1
#1 is very easy but not all that useful since LLR values will vary quite a bit. #1 also retains the O\(n) computation complexity. I imagine #1 would be used with #2 since #2 is much more computationally complex and can output thresholds for #1.
#2 require worst-case O(n^2) complexity. Some matrix pairs will have low dimensionality in one direction or both. In fact this low dimensionality is the reason we need a different kind of downsampling for these pairs. Imagine a conversion A'A which is items by items and may be very large but sparse, then A'B may be products by gender, so a rank of 2 columns but much denser.
The calculation for #2 would, I believe, require performing the un-downsampled A'A then determining the threshold from the LLR scores, then making another pass to downsample, this will add significant computation time and could make it impractical except for rare re-calculation tasks. In which case the absolute threshold would be recorded and used for subsequent A'A and A'B using #1.
Since it is likely to be impractical to calculate #2 very often it may be better done as an analytics job rather than part of the A'A job.
For most recommender cases the current downsampling method is fine but for other uses of the CCO algorithm #2 may be required for occasional threshold re-calc. In some sense we won't know until we try.
Any comments from [~tdunning] or [~dlyubimov] would be welcome
was (Author: pferrel):
Steps:
1) allow an array of absolute LLR value thresholds, one for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that non-correlation is rejected) or fraction of total cross-occurrences are retained after downsampling. To reduce how often this must be done the absolute value thresholds should be output after calculation for later re-use in #1
#1 is very easy but not all that useful since LLR values will vary quite a bit. #1 also retains the O(n) computation complexity. I imagine #1 would be used with #2 since #2 is much more computationally complex and can output thresholds for #1.
#2 require worst-case O(n^2) complexity. Some matrix pairs will have low dimensionality in one direction or both. In fact this low dimensionality is the reason we need a different kind of downsampling for these pairs. Imagine a conversion A'A which is items by items and may be very large but sparse, then A'B may be products by gender, so a rank of 2 columns but much denser.
The calculation for #2 would, I believe, require performing the un-downsampled A'A then determining the threshold from the LLR scores, then making another pass to downsample, this will add significant computation time and could make it impractical except for rare re-calculation tasks. In which case the absolute threshold would be recorded and used for subsequent A'A and A'B using #1.
Since it is likely to be impractical to calculate #2 very often it may be better done as an analytics job rather than part of the A'A job.
For most recommender cases the current downsampling method is fine but for other uses of the CCO algorithm #2 may be required for occasional threshold re-calc. In some sense we won't know until we try.
Any comments from [~tdunning] or [~dlyubimov] would be welcome
Post by Andrew Palumbo (JIRA)Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)