Sophia Liu |

Microsoft Data scientist

Sophia Liu is a Data scientist in the Analysis and Experimentation (A&E) team at Microsoft. Microsoft A&E team runs thousands of experiments per month for products across Microsoft such as Bing, Microsoft News, Skype, Windows, Xbox, Office etc. Dr. Liu received her M.S. and PhD degrees in Electrical Engineering from Columbia University and Northwestern University in 2012 and 2016, respectively. During her graduate study, she has won two best paper awards out of 14 international publications and conducted internships in Bell Labs, Cisco and Alliance Data Systems. At Microsoft, she works with Windows and Edge browser team to improve products and user experience with experimentation. Dr. Liu has given 50+ public talks in international conferences, Women in Tech events such as GHC X1 Seattle 2018 and SWE Annual Conference 2014 and other outreach programs and is specialized in audience engagement.

Sophia Liu |

Microsoft

Data scientist

即将开始的A2M课程

Online Controlled Experimentation at Scale

数据科学

2019-05-18 10:00--11:00

ABSTRACT
Online controlled experimentation (A/B testing) has been proven to be one of the most effective ways to improve products and drive revenue through data science. In this proposal, we summarize lessons learned from running thousands of experiments across multiple products with real examples from Microsoft. Then we present challenges and best practices for executing large-scale online controlled experiments.

AUDIENCE
This is an intermediate technical talk on digital randomized controlled experiments, i.e., A/B testing. Audience are expected to have basic knowledge of probability, statistics and hypothesis testing. We will cover basic concepts, but participants are assumed to have some prior knowledge. With real-world example experiments at Microsoft, we will present challenges, insights and learnings through several user-engaged games and exercises.

解决思路/成功要点：
What is experimentation, i.e., A/B tests?
Experimentation, i.e., A/B tests, is a method for comparing two behaviors—commonly, the existing behavior and a new behavior—against each other to determine which one performs better. Two or more variants are shown to users at random, and statistical analysis is used to determine which variation performs better against various business objectives of the product/organization. Through randomization, all other factors that could impact outcomes are balanced between control and treatment groups. Thus, using data collected from users, statistical tests can establish causal impact between changes in behavior and changes in key metrics.

成果：
Challenges and Best Practices in large-scale online experimentation
Since client-side experimentation requires code to be shipped to customers and data to be collected/sent from client devices, multiple challenges exist for client-side experimentation. In this session, we will discuss the common limitations and challenges in client-side experimentation with suggestions and guidelines for best practices.
1. Shipping cadence
Client-side experimentation requires different behavior to be sent to customers. This imposes two major challenges. First, experimentation need to coincide with shipping cadence. Given that client-software and OS need extensive testing, such updates may not happen as regular as server-side products (e.g. websites) which can quickly revert in the case of failures. Second, update of client updates is neither instantons nor complete (i.e. some users will not update). This limits the client-side experimentation’s agility. Silent updates or increase software shipping cadence using controlled rollouts can help by iterating faster and more efficiently.
2. Coding feature and call back
Client-side experimentation starts with the creation of one or more “variants” code paths, where each variant is an adjustment to the behavior of the system. Coding such features efficiently in the product design phase and collecting users’ assignment requires a centralized tool/platform. Moreover, if feature code introduced bugs/failures during the experiment, a centralized configuration platform needs to be in place to stop the experiment and withdraw those features from the client.
3. Trustworthy data collection
Since users’ telemetry (e.g. usage of the app) need to be sent back to the server, network connection/access is required. Thus, user data can be delayed or even lost. Furthermore, client time can be messy. Those data collected can cause validity issues which can dilute and sometimes invalidate the analysis results. One recommendation is to refine data cooking pipeline, adding data quality metrics, and continuously monitor data collection quality through a feedback system.
4. Metrics monitoring and alerting
Experimenting on thousands of features can be overwhelming. It is unscalable to manually check metrics movement per experiment. This is particularly important when failures occur, and we need to call back the features from the client. Having alerting system in place is key to solve these issues. In addition, we highly recommend experimenters to have a rich set of metrics which can help identify problems such as where we are blind in addition to successes. Three layers of metrics are suggested to use as guidelines for metrics design.
(a) Guardrail metrics. These are a set of organizational metrics that should not regress during the experiment. Some example metrics can be time to load a page for Bing, system usage for Windows etc.
(b) Diagnostic metrics. These are the set of metrics that can signify what the impact of local features brings to the product. Example metrics are funnel metrics for a shopping website.
(c)Key metrics contributing to Overall Evaluation Criteria (OEC). The OEC metrics are the ones that an organization regards as the key criteria for success. It sometimes happens that some teams are optimizing for retention rate while some teams are optimizing for revenue. Then leadership team should union them together as an organization to provide guidance on the OEC design making sure that teams are making efforts in the same direction.
5. Trustworthy data analysis
Running statistical tests and computing correct variances is critical for accurate interpretation of experimental results. Data issues like insufficient sample sizes, highly skewed distributions, and outliers pose serious challenges. Triangulation using multiple metrics and advanced statistical techniques can help in this scenario.

即将开始的A2M课程

Online Controlled Experimentation at Scale

数据科学

2019-05-18 10:00--11:00