[jira] [Created] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

Discussion:

[jira] [Created] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

Andrew Palumbo (JIRA)

2016-05-09 22:03:13 UTC

Andrew Palumbo created MAHOUT-1853:
--------------------------------------

Summary: Improvements to CCO (Correlated Cross-Occurrence)
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0

Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Pat Ferrel (JIRA)

2016-05-26 16:41:13 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302371#comment-15302371 ]

Pat Ferrel commented on MAHOUT-1853:
------------------------------------

Steps:

1) allow an array of absolute LLR value thresholds for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that non-correlation is rejected) or fraction of total cross-occurrences are retained after downsampling. To reduce how often this must be done the absolute value thresholds should be output after calculation for later re-use in #1

#1 is very easy but not all that useful since LLR values will vary quite a bit. #1 also retains the O(n) computation complexity. I imagine #1 would be used with #2 since #2 is much more computationally complex and can output thresholds for #1.

#2 require worst-case O(n^2) complexity. Some matrix pairs will have low dimensionality in one direction or both. In fact this low dimensionality is the reason we need a different kind of downsampling for these pairs. Imagine a conversion A'A which is items by items and may be very large but sparse, then A'B may be products by gender, so a rank of 2 columns but much denser.

The calculation for #2 would, I believe, require performing the un-downsampled A'A then determining the threshold from the LLR scores, then making another pass to downsample, this will add significant computation time and could make it impractical except for rare re-calculation tasks. In which case the absolute threshold would be recorded and used for subsequent A'A and A'B using #1.

Since it is likely to be impractical to calculate #2 very often it may be better done as an analytics job rather than part of the A'A job.

For most recommender cases the current downsampling method is fine but for other uses of the CCO algorithm #2 may be required for occasional threshold re-calc. In some sense we won't know until we try.

Any comments from [~tdunning] or [~dlyubimov] would be welcome

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Pat Ferrel (JIRA)

2016-05-26 17:04:13 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302371#comment-15302371 ]

Pat Ferrel edited comment on MAHOUT-1853 at 5/26/16 5:04 PM:
-------------------------------------------------------------

Steps:

1) allow an array of absolute LLR value thresholds, one for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that non-correlation is rejected) or fraction of total cross-occurrences are retained after downsampling. To reduce how often this must be done the absolute value thresholds should be output after calculation for later re-use in #1

#1 is very easy but not all that useful since LLR values will vary quite a bit. #1 also retains the O\(n) computation complexity. I imagine #1 would be used with #2 since #2 is much more computationally complex and can output thresholds for #1.

#2 require worst-case O(n^2) complexity. Some matrix pairs will have low dimensionality in one direction or both. In fact this low dimensionality is the reason we need a different kind of downsampling for these pairs. Imagine a conversion A'A which is items by items and may be very large but sparse, then A'B may be products by gender, so a rank of 2 columns but much denser.

The calculation for #2 would, I believe, require performing the un-downsampled A'A then determining the threshold from the LLR scores, then making another pass to downsample, this will add significant computation time and could make it impractical except for rare re-calculation tasks. In which case the absolute threshold would be recorded and used for subsequent A'A and A'B using #1.

Since it is likely to be impractical to calculate #2 very often it may be better done as an analytics job rather than part of the A'A job.

For most recommender cases the current downsampling method is fine but for other uses of the CCO algorithm #2 may be required for occasional threshold re-calc. In some sense we won't know until we try.

Any comments from [~tdunning] or [~dlyubimov] would be welcome

was (Author: pferrel):
Steps:

1) allow an array of absolute LLR value thresholds, one for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that non-correlation is rejected) or fraction of total cross-occurrences are retained after downsampling. To reduce how often this must be done the absolute value thresholds should be output after calculation for later re-use in #1

#1 is very easy but not all that useful since LLR values will vary quite a bit. #1 also retains the O(n) computation complexity. I imagine #1 would be used with #2 since #2 is much more computationally complex and can output thresholds for #1.

#2 require worst-case O(n^2) complexity. Some matrix pairs will have low dimensionality in one direction or both. In fact this low dimensionality is the reason we need a different kind of downsampling for these pairs. Imagine a conversion A'A which is items by items and may be very large but sparse, then A'B may be products by gender, so a rank of 2 columns but much denser.

The calculation for #2 would, I believe, require performing the un-downsampled A'A then determining the threshold from the LLR scores, then making another pass to downsample, this will add significant computation time and could make it impractical except for rare re-calculation tasks. In which case the absolute threshold would be recorded and used for subsequent A'A and A'B using #1.

Since it is likely to be impractical to calculate #2 very often it may be better done as an analytics job rather than part of the A'A job.

For most recommender cases the current downsampling method is fine but for other uses of the CCO algorithm #2 may be required for occasional threshold re-calc. In some sense we won't know until we try.

Any comments from [~tdunning] or [~dlyubimov] would be welcome

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Pat Ferrel (JIRA)

2016-05-26 17:04:13 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302371#comment-15302371 ]

Pat Ferrel edited comment on MAHOUT-1853 at 5/26/16 5:03 PM:
-------------------------------------------------------------

Steps:

1) allow an array of absolute LLR value thresholds, one for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that non-correlation is rejected) or fraction of total cross-occurrences are retained after downsampling. To reduce how often this must be done the absolute value thresholds should be output after calculation for later re-use in #1

#1 is very easy but not all that useful since LLR values will vary quite a bit. #1 also retains the O(n) computation complexity. I imagine #1 would be used with #2 since #2 is much more computationally complex and can output thresholds for #1.

#2 require worst-case O(n^2) complexity. Some matrix pairs will have low dimensionality in one direction or both. In fact this low dimensionality is the reason we need a different kind of downsampling for these pairs. Imagine a conversion A'A which is items by items and may be very large but sparse, then A'B may be products by gender, so a rank of 2 columns but much denser.

The calculation for #2 would, I believe, require performing the un-downsampled A'A then determining the threshold from the LLR scores, then making another pass to downsample, this will add significant computation time and could make it impractical except for rare re-calculation tasks. In which case the absolute threshold would be recorded and used for subsequent A'A and A'B using #1.

Since it is likely to be impractical to calculate #2 very often it may be better done as an analytics job rather than part of the A'A job.

For most recommender cases the current downsampling method is fine but for other uses of the CCO algorithm #2 may be required for occasional threshold re-calc. In some sense we won't know until we try.

Any comments from [~tdunning] or [~dlyubimov] would be welcome

was (Author: pferrel):
Steps:

1) allow an array of absolute LLR value thresholds for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that non-correlation is rejected) or fraction of total cross-occurrences are retained after downsampling. To reduce how often this must be done the absolute value thresholds should be output after calculation for later re-use in #1

#1 is very easy but not all that useful since LLR values will vary quite a bit. #1 also retains the O(n) computation complexity. I imagine #1 would be used with #2 since #2 is much more computationally complex and can output thresholds for #1.

#2 require worst-case O(n^2) complexity. Some matrix pairs will have low dimensionality in one direction or both. In fact this low dimensionality is the reason we need a different kind of downsampling for these pairs. Imagine a conversion A'A which is items by items and may be very large but sparse, then A'B may be products by gender, so a rank of 2 columns but much denser.

The calculation for #2 would, I believe, require performing the un-downsampled A'A then determining the threshold from the LLR scores, then making another pass to downsample, this will add significant computation time and could make it impractical except for rare re-calculation tasks. In which case the absolute threshold would be recorded and used for subsequent A'A and A'B using #1.

Since it is likely to be impractical to calculate #2 very often it may be better done as an analytics job rather than part of the A'A job.

For most recommender cases the current downsampling method is fine but for other uses of the CCO algorithm #2 may be required for occasional threshold re-calc. In some sense we won't know until we try.

Any comments from [~tdunning] or [~dlyubimov] would be welcome

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Pat Ferrel (JIRA)

2016-07-24 17:52:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391126#comment-15391126 ]

Pat Ferrel commented on MAHOUT-1853:
------------------------------------

To reword this issue...

The CCO analysis code currently only employs a single # of values per row of the P’? matrices. This has proven an insufficient threshold for many of the possible cross-occurrence types. The problem is that for a user * item input matrix, which becomes an item * item output a fixed # per row is fine but the implementation is a bit meaningless when there are only 20 columns of the ? matrix. For instance if ? = C category preferences, there may be only 20 possible categories and with a threshold of 100 and the fact that users often have enough usage to trigger preference events on all categories (though resulting in a small LLR value), the P’C matrix is almost completely full. This reduces any value in P’C.

There are several ways to address:
1) have a # of indicators per row threshold for every matrix, not one for all (the current impl)
2) use a fixed LLR threshold value per matrix
3) use a confidence of correlation value (a % maybe) that is calculated from the data by looking at the distribution in P’C or other. This is potentially O(n^2) where n = number of items in the matrix. This may be practical to calculate for some types of data since n may be very small.

1 and 2 are easy in the extreme, #3 can actually be calculated after the fact and used in #2 even if it is not included in Mahout.

starting work on #1 and #2

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Pat Ferrel (JIRA)

2016-08-04 16:05:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pat Ferrel updated MAHOUT-1853:
-------------------------------
Sprint: Jan/Feb-2016

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Pat Ferrel (JIRA)

2016-08-04 16:16:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391126#comment-15391126 ]

Pat Ferrel edited comment on MAHOUT-1853 at 8/4/16 4:15 PM:
------------------------------------------------------------

To reword this issue...

The CCO analysis code currently only employs a single # of values per row of the P’X matrices. This has proven an insufficient threshold for many of the possible cross-occurrence types. The problem is that for a user * item input matrix, which becomes an item * item output a fixed # per row is fine but the implementation is a bit meaningless when there are only 20 columns of the X matrix. For instance if X = C category preferences, there may be only 20 possible categories and with a threshold of 100 and the fact that users often have enough usage to trigger preference events on all categories (though resulting in a small LLR value), the P’C matrix is almost completely full. This reduces any value in P’C.

There are several ways to address:
1) have a # of indicators per row threshold for every P'X matrix, not one for all (the current impl)
2) use a fixed LLR threshold value per matrix
3) use a confidence of correlation value (a % maybe) that is calculated from the data by looking at the distribution in P’C or other. This is potentially O(n^2) where n = number of items in the matrix. This may be practical to calculate for some types of data since n may be very small.

1 and 2 are easy in the extreme, #3 can actually be calculated after the fact and used in #2 even if it is not included in Mahout.

I've started work on #1 and #2

[~ssc][~tdunning] I'm especially looking for comments on #3 above, calculating a % confidence of correlation. The function we use for LLR scoring is https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala#L210

was (Author: pferrel):
To reword this issue...

The CCO analysis code currently only employs a single # of values per row of the P’? matrices. This has proven an insufficient threshold for many of the possible cross-occurrence types. The problem is that for a user * item input matrix, which becomes an item * item output a fixed # per row is fine but the implementation is a bit meaningless when there are only 20 columns of the ? matrix. For instance if ? = C category preferences, there may be only 20 possible categories and with a threshold of 100 and the fact that users often have enough usage to trigger preference events on all categories (though resulting in a small LLR value), the P’C matrix is almost completely full. This reduces any value in P’C.

There are several ways to address:
1) have a # of indicators per row threshold for every matrix, not one for all (the current impl)
2) use a fixed LLR threshold value per matrix
3) use a confidence of correlation value (a % maybe) that is calculated from the data by looking at the distribution in P’C or other. This is potentially O(n^2) where n = number of items in the matrix. This may be practical to calculate for some types of data since n may be very small.

1 and 2 are easy in the extreme, #3 can actually be calculated after the fact and used in #2 even if it is not included in Mahout.

starting work on #1 and #2

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Ted Dunning (JIRA)

2016-08-04 17:10:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408148#comment-15408148 ]

Ted Dunning commented on MAHOUT-1853:
-------------------------------------

First, I think that the root LLR function would be more appropriate so that you don't have indicators that occur less often than expected.

Regarding the threshold, significance is monotonic in LLR score so thresholding on either is equivalent. The only question is picking the value. Picking based on a significance level has no strong motivation because there is a vast number of repeated and correlated comparisons in play.

As such, I would simply use something like t-digest (available in Mahout as part of the OnlineSummarizer if that has survived, otherwise available as a simple dependency) to aggregate the scores you get in these cases and pick, say, the top 1-10%. The knob should be turned based on how sparse you want the indicators to be on average. If you have the distribution of all the scores available, then picking the cutoff is trivial.

Note that this isn't really n^2. Instead, it is k n = O(n) where k is the number of categories. This is different from the case of text or general viewing behaviors because the vocabulary there is unbounded and grows with n. This means that the computation of the indicators is only O(k n) for the counting and O(k^2) for the cooccurrence counting. If k_max is the interaction cut in some other behavior that has unbounded size, then the cost of the counting is O(k k_max n) for counting and scoring. Both are scalable due to the limitation imposed by the finiteness of k and the artificial limit of the interaction cut.

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Pat Ferrel (JIRA)

2016-08-04 18:19:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408256#comment-15408256 ]

Pat Ferrel commented on MAHOUT-1853:
------------------------------------

is rootLLR normally distributed (the positive half)? If so we'd have to calculate all rootLLR scores and fit the normal params to get the 10% or other adaptive threshold, right?

I understand that O(n^2) never occurs in practice. Even for cases where O(k k_max n) is high intuition would say that this threshold could be calculated once and applied for some time since it will tend to stay the same for any specific type of indicator. Calculating it may be a once in a great while operation and the threshold would usually be used in #2 above.

I'm somewhat ignorant of t-digest other than having read your anomaly detection book. I think it's in Mahout but the docs are here: https://github.com/tdunning/t-digest. I assume that using t-digest would remove the need to do any separate distribution param fitting (as long as we use rootLLR) and could even be applied as online learning producing an adaptive threshold to feed into #2 above? I imagine it can also be applied periodically on P`X in batch.

No need to respond if I'm on the right track.

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Andrew Palumbo (JIRA)

2016-08-04 18:29:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408275#comment-15408275 ]

Andrew Palumbo commented on MAHOUT-1853:
----------------------------------------

T-digest is still in mahout-math.. I believe it is still shipped to the backend in spark-dependency-reduced.jar. Its been a while since we've upgraded the versions though.. not sure if any new versions have been released.

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Ted Dunning (JIRA)

2016-08-04 18:37:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408285#comment-15408285 ]

Ted Dunning commented on MAHOUT-1853:
-------------------------------------

[~Andrew_Palumbo] There have been a number of upgrades to t-digest. Faster more accurate. Very nearly API compatible.

[~pferrel] Yes, root LLR is normally distributed if you have no relationship and have enough data to see the negative side. Most importantly, it is signed.

And yes, the t-digest scan can be pretty rare. Once you know how the mass of data looks, you are good to go.

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Pat Ferrel (JIRA)

2016-08-04 18:58:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408326#comment-15408326 ]

Pat Ferrel commented on MAHOUT-1853:
------------------------------------

If t-digest is more tolerant of "not having enough data" than fitting the params of a normal dist then I'll do #1 and #2 now for 0.13. Then for #3 will integrate t-digest as a way to calculate the threshold for #2 in the next phase. #3 would be the release after, which would give us time to upgrade t-digest or cut it loose and treat as a dependency, it's in the maven repos.

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Ted Dunning (JIRA)

2016-08-04 19:48:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408387#comment-15408387 ]

Ted Dunning commented on MAHOUT-1853:
-------------------------------------

[~pferrel] Computing the parameters of a normal distribution is definitely cheaper than updating a t-digest, but I doubt that the difference will be visible. It takes a few additions and divisions to update the mean and sd, while it takes 100-200ns on average to update a t-digest with a new sample.

But the big win happens when the data being collected is grossly non-normal, or when the stuff of interest is an anomalous tail in an otherwise normal distribution. Both of these cases apply in this situation.

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Pat Ferrel (JIRA)

2016-08-05 15:49:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409595#comment-15409595 ]

Pat Ferrel commented on MAHOUT-1853:
------------------------------------

Great, that's what I wanted to hear. Normal in principal but something more tolerant to wonky distributions is worth trying and in this case we'll avoid doing it every time by saving the threshold for future runs.

Thanks

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Pat Ferrel (JIRA)

2016-08-21 00:57:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429550#comment-15429550 ]

Pat Ferrel commented on MAHOUT-1853:
------------------------------------

ok first part implemented. Not sure Ted's suggestion will get into this release so I'm moving this Jira to not loose his comments.

Finished the fixed threshold and number of indicators per item for every pair of matrices. So A'A can have an llr threshold as well as a # per row that is different than A'B and so forth.

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2016-08-21 01:15:21 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429556#comment-15429556 ]

ASF GitHub Bot commented on MAHOUT-1853:
----------------------------------------

GitHub user pferrel opened a pull request:

https://github.com/apache/mahout/pull/251

[MAHOUT-1853] Implementing first part

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pferrel/mahout MAHOUT-1853

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mahout/pull/251.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #251

----
commit dc6bbc53b143e3db087843ebedc2f463d94d856e
Author: pferrel <***@occamsmachete.com>
Date: 2016-08-21T01:13:00Z

implementing first part of MAHOUT-1853

----

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2016-08-22 16:14:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431095#comment-15431095 ]

ASF GitHub Bot commented on MAHOUT-1853:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/251#discussion_r75707999

--- Diff: math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala ---
@@ -211,9 +314,17 @@ object SimilarityAnalysis extends Serializable {

}

- def computeSimilarities(drm: DrmLike[Int], numUsers: Int, maxInterestingItemsPerThing: Int,
- bcastNumInteractionsB: BCast[Vector], bcastNumInteractionsA: BCast[Vector],
- crossCooccurrence: Boolean = true) = {
+ def computeSimilarities(
+ drm: DrmLike[Int],
+ numUsers: Int,
+ maxInterestingItemsPerThing: Int,
+ bcastNumInteractionsB: BCast[Vector],
+ bcastNumInteractionsA: BCast[Vector],
+ crossCooccurrence: Boolean = true,
+ minLLROpt: Option[Double] = None) = {
+
+ val minLLR = minLLROpt.getOrElse(0.0d) // accept all values if not specified
--- End diff --

i think style convention was to use 0.0 (minority split in favor 0d) but never 0.0d

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2016-08-22 16:15:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431098#comment-15431098 ]

ASF GitHub Bot commented on MAHOUT-1853:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/251#discussion_r75708181

--- Diff: spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala ---
@@ -91,13 +93,13 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed

//cross similarity
val matrixCrossCooc = drmCooc(1).checkpoint().collect
- val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocBtAControl)
+ val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBControl)
n = (new MatrixOps(m = diff2Matrix)).norm
n should be < 1E-10

}

- test("cooccurrence [A'A], [B'A] double data using LLR") {
+ test("Cross-occurrence [A'A], [B'A] double data using LLR") {
val a = dense(
(100000.0D, 1.0D, 0.0D, 0.0D, 0.0D),
( 0.0D, 0.0D, 10.0D, 1.0D, 0.0D),
--- End diff --

same note. either 0.0 or 0d but not 0.0d

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2016-08-22 16:16:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431100#comment-15431100 ]

ASF GitHub Bot commented on MAHOUT-1853:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/251#discussion_r75708308

--- Diff: spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala ---
@@ -191,14 +193,115 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed

//cross similarity
val matrixCrossCooc = drmCooc(1).checkpoint().collect
- val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocBtANonSymmetric)
+ val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBNonSymmetric)
n = (new MatrixOps(m = diff2Matrix)).norm

//cooccurrence without LLR is just a A'B
//val inCoreAtB = a.transpose().times(b)
//val bp = 0
}

+ test("Cross-occurrence two IndexedDatasets"){
+ val a = dense(
+ (1, 1, 0, 0, 0),
+ (0, 0, 1, 1, 0),
+ (0, 0, 0, 0, 1),
+ (1, 0, 0, 1, 0))
+
+ val b = dense(
+ (0, 1, 1, 0),
+ (1, 1, 1, 0),
+ (0, 0, 1, 0),
+ (1, 1, 0, 1))
+
+ val users = Seq("u1", "u2", "u3", "u4")
+ val itemsA = Seq("a1", "a2", "a3", "a4", "a5")
+ val itemsB = Seq("b1", "b2", "b3", "b4")
+ val userDict = new BiDictionary(users)
+ val itemsADict = new BiDictionary(itemsA)
+ val itemsBDict = new BiDictionary(itemsB)
+
+ // this is downsampled to the top 2 values per row to match the calc
+ val matrixLLRCoocAtBNonSymmetric = dense(
+ (0.0, 1.7260924347106847, 1.7260924347106847, 0.0),
--- End diff --

this is our accepted convention, good

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Hudson (JIRA)

2016-09-13 20:43:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15488364#comment-15488364 ]

Hudson commented on MAHOUT-1853:
--------------------------------

FAILURE: Integrated in Jenkins build Mahout-Quality #3393 (See [https://builds.apache.org/job/Mahout-Quality/3393/])
MAHOUT-1853: Add new thresholds and partitioning methods to (pat: rev b5fe4aab22e7867ae057a6cdb1610cfa17555311)
* (delete) CHANGELOG
* (edit) math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala
* (edit) spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Pat Ferrel (JIRA)

2016-10-16 17:20:20 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pat Ferrel resolved MAHOUT-1853.
--------------------------------
Resolution: Fixed

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

ASF GitHub Bot (JIRA)

2017-02-26 01:45:44 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884489#comment-15884489 ]

ASF GitHub Bot commented on MAHOUT-1853:
----------------------------------------

Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/251

@pferrel is this something that needs to go into 0.13.0?

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

ASF GitHub Bot (JIRA)

2017-02-26 17:11:46 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884818#comment-15884818 ]

ASF GitHub Bot commented on MAHOUT-1853:
----------------------------------------

Github user pferrel commented on the issue:

https://github.com/apache/mahout/pull/251

It has already been merged. The style comments are valid but not required. Just saw them now (2/26/2017)

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

ASF GitHub Bot (JIRA)

2017-02-26 17:12:45 UTC

[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884819#comment-15884819 ]

ASF GitHub Bot commented on MAHOUT-1853:
----------------------------------------

Github user pferrel closed the pull request at:

https://github.com/apache/mahout/pull/251

Post by Andrew Palumbo (JIRA)
Improvements to CCO (Correlated Cross-Occurrence)
-------------------------------------------------
Key: MAHOUT-1853
URL: https://issues.apache.org/jira/browse/MAHOUT-1853
Project: Mahout
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
Assignee: Pat Ferrel
Fix For: 0.13.0
Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.

--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

23 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Andrew Palumbo (JIRA) 2016-05-09 22:03:13 UTC

Pat Ferrel (JIRA) 2016-05-26 16:41:13 UTC

Pat Ferrel (JIRA) 2016-05-26 17:04:13 UTC

Pat Ferrel (JIRA) 2016-05-26 17:04:13 UTC

Pat Ferrel (JIRA) 2016-07-24 17:52:20 UTC

Pat Ferrel (JIRA) 2016-08-04 16:05:20 UTC

Pat Ferrel (JIRA) 2016-08-04 16:16:20 UTC

Ted Dunning (JIRA) 2016-08-04 17:10:20 UTC

Pat Ferrel (JIRA) 2016-08-04 18:19:20 UTC

Andrew Palumbo (JIRA) 2016-08-04 18:29:20 UTC

Ted Dunning (JIRA) 2016-08-04 18:37:20 UTC

Pat Ferrel (JIRA) 2016-08-04 18:58:20 UTC

Ted Dunning (JIRA) 2016-08-04 19:48:20 UTC

Pat Ferrel (JIRA) 2016-08-05 15:49:20 UTC

Pat Ferrel (JIRA) 2016-08-21 00:57:20 UTC

ASF GitHub Bot (JIRA) 2016-08-21 01:15:21 UTC

ASF GitHub Bot (JIRA) 2016-08-22 16:14:20 UTC

ASF GitHub Bot (JIRA) 2016-08-22 16:15:20 UTC

ASF GitHub Bot (JIRA) 2016-08-22 16:16:20 UTC

Hudson (JIRA) 2016-09-13 20:43:20 UTC

Pat Ferrel (JIRA) 2016-10-16 17:20:20 UTC

ASF GitHub Bot (JIRA) 2017-02-26 01:45:44 UTC

ASF GitHub Bot (JIRA) 2017-02-26 17:11:46 UTC

ASF GitHub Bot (JIRA) 2017-02-26 17:12:45 UTC

about - legalese

Loading...