Recommenders and MABs

Dmitriy Lyubimov

2016-09-21 22:07:46 UTC

there's been a great blog on that somewhere on richrelevance blog... But i
have a vague feeling based on what you are saying it may be all old news to
you...

[1] http://engineering.richrelevance.com/bandits-recommendation-systems/
and there's more in the series

Iâve been thinking about how one would implement an application that only
shows recommendations. This is partly because people want to build such
things.
There are many problems with this including cold start and overfit.
However these problems also face MABs and are solved with sampling schemes.
1) CF based recommender, 2) random recommendations, 3) popular recs (by
some measure). If we look at each individual as facing an MAB with a
sampling algo trained by them to pull recs from the 3 (or more) arms. This
implies an MAB per user.
The very first visit to the application would randomly draw from the
choices and since there is no user data the recs engine would have to be
able to respond (perhaps with random recs) the same would have to be true
of the popular model (returning random), and random is always happy. The
problem with this is that none of the arms are completely independent and
the model driving each arm will change over time.
The first time a user visits will result in a new MAB for them and will
randomly draw from all arms but may get better responses from popular (with
no user specific data yet in the system for cf). So the sampling will start
to favor popular but will still explore other methods. When enough data is
accumulated to start making good recs, the recommender will start to
outperform popular and will get more of the userâs reinforcement.
This seems to work with several unanswered questions and one problem to
avoidâoverfit. We would need a sampling method that would never fully
converge or the user would never get a chance to show their
expanding/changing preferences. The cf recommender will also overfit if
non-cf items are not mixed in. Of the sampling methods Iâve seen for MABs,
Greedy will not work but even with some form of Bayesian/Thompson sampling
the question is how to parameterize the sampling. With too little
convergence we get sub-optimal exploit but we get the same with too much
convergence and this will also overfit the cf recs.
I imagine we could train a meta-model on the mature explore amount by
trying different parameterization and finding if there is one answer for
all or we could resort to heuristic rulesâeven business rules.
If anyone has read this far, any ideas or comments?