In an article recently posted to the Meta Research website, researchers introduced a contextual-bandit (CB) framework for optimizing real-time bidding mechanisms in internet applications. It aimed to enhance user engagement by jointly optimizing bid prices and item rankings in recommendation systems. Using reinforcement learning with deep neural networks, the proposed method achieved significant improvements in user interactions, as demonstrated in online experiments on Facebook services.
Background
In the realm of modern recommendation systems, providing users with a diverse and personalized selection of content has become pivotal for enhancing user experiences and preventing content fatigue. Noteworthy platforms like Pinterest and Facebook have adopted real-time bidding (RTB) mechanisms to optimize content presentation during auction sessions.
Previous studies in RTB, particularly in advertising and sponsored search domains, have extensively explored utility prediction, bid price optimization, and budget allocation. However, the specific problem of bid price optimization over multiple candidates with the added requirement of ranking has been underexplored.
Traditional approaches often involve human-designed value formulas with adjustable coefficients, limiting their ability to represent complex environments effectively. While existing research in RTB has primarily focused on scenarios where each service bids for presenting a single content candidate, this paper addressed a more generalized challenge – the joint decision-making of bid prices and ranking for multiple content pieces in a slot.
The researchers introduced a novel framework named bidding and ranking together (BART) that leveraged CB algorithms for learning optimal policies in RTB scenarios. The innovative approach reduced the need for manual parameter tuning and allowed for the derivation of sophisticated policies from sub-optimal demonstrations. By applying the proposed algorithm to major services in the home feed of Facebook, the authors demonstrated superior performance over hand-tuned baselines in online experiments, showcasing its potential impact on enhancing user engagement and experience. It filled a critical gap in the literature by addressing bid price optimization and ranking in a more generalized setting, contributing to the advancement of recommendation systems.
Model Formulation
In the context of recommendation systems, the BART problem was addressed by formulating it as a CB setup. Each user session triggered an auction session where services bid for the opportunity to present content. The service selected a set of candidates, and their features, including utility predictions, form the contextual state. The goal was to learn a policy determining the scores for each candidate, influencing both bid prices and rank orders.
The bid price function was defined as a weighted sum of sorted scores, proportional to the empirical conversion rate. If the service's bid was the highest, top-ranked items were sequentially shown to the user, deducting the bid price from the budget. The CB setup involved states (user and candidate information), actions (candidate scores), and rewards, which included bid costs and potential benefits from user interactions with displayed items. The reward function considered bid losses, bid costs, and engagement metrics, aiming to optimize bidding and ranking strategies. This approach provided a comprehensive framework for learning effective policies in real-time recommendation scenarios.
Policy Optimization
The authors introduced the policy optimization process for BART, detailing the top-K Gaussian policy formulation. The policy defined the probability distribution of actions based on the state. To address the issue of irrelevant randomness, a top-K Gaussian policy was proposed, focusing on the top candidates contributing to bid prices and user engagement. The batch learning objective was to maximize the expected reward, optimizing parameters via offline training. The training objective accounted for variance issues, employing the Top-K Gaussian policy to enhance stability.
The researchers also presented a reward-shaping algorithm to determine hyperparameters in the reward function involving the bid loss and engagement reward. The algorithm involved inferring these parameters through a simpler policy tuned in online experiments. This approach simplified the computation compared to traditional inverse reinforcement learning methods, providing accurate reward settings for effective policy optimization in the BART framework.
Experimental evaluation
The experiments evaluated the BART method on two Facebook home feed services: "Groups you should join" (GYSJ) and "Friend requests" (FR). BART competed for the same content slot, displaying the top 20 items to users. In GYSJ, the existing linear formula combined the expected click-through rate (eCTR) and post-click conversion rate (eCVR) to maximize user engagement, while FR aimed to encourage users to accept friend requests based on a probability model. The experiments involved a 22-day and 30-day experiment for GYSJ and FR, respectively.
For GYSJ, the BART policy outperformed the hand-tuned value formula, increasing engagement metrics by 0.44%, with a 9.8% rise in impressions and a 14.7% increase in group joins. The BART policy adjusted bid prices more aggressively than the value formula. In FR, the BART model increased accepted friend requests by 7.0%, whereas the logging policy showed an 11.3% drop. Both services experienced improvements in sessions and viewed friend requests.
The BART models were deployed into Facebook production, yielding statistically significant improvements in daily and monthly active users. The backtest results aligned with pretests, confirming the efficacy of BART in enhancing user engagement metrics across the evaluated services. The approach's ability to combine bidding and ranking strategies proved beneficial in different service contexts.
Conclusion
In conclusion, the authors framed the BART problem in a free-market recommendation system as a CB. Using top-K Gaussian policies and a lightweight reward-shaping algorithm, they removed noise in offline stochastic gradients. Their approach, validated in online experiments on Facebook services, significantly improved top-line user engagement metrics. Future work aims to enhance policy uncertainty understanding, exploring solutions for joint optimization across multiple services.