Measuring What Makes Readers Subscribe to The New York Times

How we build fast and reusable econometric models in-house

Published in

NYT Open

6 min readNov 15, 2018

Understanding what drives someone to purchase a news subscription is far from simple. Each potential subscriber is exposed to different news stories, advertisements and messages both on our site and off of it. Separating these influences is an overwhelming task, but understanding the power of news and marketing drivers is necessary to build an effective subscription business. If we are to spend money on marketing and media efficiently, we need to quantify and understand how each stimulus, both on-site and off-site, influences and contributes to subscriptions.

An entire industry exists to tackle this very problem, with varying solutions from Market Mix Models (MMM), to user-tracking Attribution models, to surveys. While all of these methodologies offer the ability to determine signal from noise, each comes at a great cost in terms of both time and money. When working with a vendor, a considerable amount of time and resources can be spent passing data, validating it, building a model and finally having the work presented back. Due to the steps required, a typical model can take weeks to come to fruition. But what if we need to react to market responses more immediately? To solve this problem, we decided to build a Market Mix Model infrastructure in-house.

Statistically speaking, an MMM is a process to decouple competing subscriber-conversion events; it’s a time series, multivariate regression model. For example, if we wanted to determine the effectiveness of a TV campaign on subscriptions, we would measure TV’s historical correlation with subscriptions after holding out the effects of sales, seasonality and other marketing vehicles. Simply put, to understand how any one channel is driving subscriptions we need to build and quantify all drivers holistically.

An MMM is only as strong as the inputs and leaving out any significant variable in the model can inflate results. Building out a rapid-response MMM requires having all potentially useful data on hand and in an easily accessible environment.

Data Management

Building the MMM in house starts with an immediate advantage: Most of the data necessary to build one is already being reported internally somewhere. The most important dataset, our internal event tracking of onsite behaviors (e.g. NYT pageviews), already exists in our Google Cloud Platform (GCP). While this dataset can help us understand what stories may be impacting subscription, understanding what off-site influences lead up to a visitor reaching our site is critical. Almost all of this data comes from third-party partners such as Facebook and Google, but generally is used by analysts and marketers in a siloed UI. We found that data accessibility was an obstacle, so our first step was to build APIs of each external datasource to funnel into one cloud storage. These data sources include, but are not limited to:

Search Clicks and Impressions
Offsite and Onsite Displays
Facebook Owned and Paid Impressions
Twitter Impressions, Engagement, Retweets, Replies, Likes
App Downloads
Sale Dates
Economic Variables

Having all this data pulled into a central GCP warehouse on a consistent basis gives us the data we need, but in different formats that cannot be easily pulled into a statistical model. For that, we rely on Python to not only clean and format the data, but also to run complex regressions from which we can glean insights.

When preparing to build a model, various data vendors typically ask what data we need and we almost always need all of it. The second our model finds an input variable significant (or insignificant), we need to break it down further and parse out what underlying factor is driving the result; and in order to do this we require all of the surrounding metadata. Since we store and process data with a high amount of granularity there is always the potential for processing data incorrectly or missing it altogether, which is why it is necessary to validate it.

Data Validation

Any econometric model is only as strong as the inputs gathered: “garbage in, garbage out,” as the adage goes. The common approach in the industry is to pull in all sources of data, manipulate and visualize the data and finally present it back to stakeholders and experts to ensure quality of the inputs. Since sending this data to external agencies can be time consuming to all parties involved, a lot of friction can be removed by keeping the process internal. To do this, we introduced a cloud-based reporting infrastructure connected directly to our BigQuery warehouse. This allows us to have the inputs we use for modeling transformed and directly pipelined to our analysts who can monitor the data for accuracy as soon as it is ingested from the APIs. Our data validation is thus a daily process which allows our data analysts to focus their effort on modeling.

Data Modeling

Often in data science or econometrics, the majority of the time spent on building a model can be in the prep before any actual modeling has taken place. We have found success in reducing the time spent for processing data using the Pandas Python library. Pandas has a pre-written wrapper for pulling data from GCP using BigQuery which allows for data ingestion from BigQuery to a DataFrame. (A Pandas DataFrame is a powerful, open source tool with an Excel-like structure that allows us to efficiently manipulate data.) We can migrate data to or from BigQuery in as little as three lines of code.

With simple access to BigQuery, we use Python to loop through multiple queries to ingest all necessary data from separate tables into one location. With the data in one place, we are able to transform it as needed with Pandas. There are a number of necessary manipulations needed to run a Market Mix Model, such as:

Bucketing all data into consistent time ranges
Creating seasonal indices
Creating adstock transformations to measure latent effects
Smoothing data

Most of these transformations already come with prewritten functions in the pandas library to make the manipulations as painless as possible (pivot_table for quickly bucketing data, rolling_table for smoothing data). For any custom transformations, such as adstocking, we can write our own logic to automate the process. While all of these are technically possible in SQL, Python and Pandas simplify the process with a fraction of the code. When thinking about automation and fast model turnarounds we want as little code to debug as humanly possible, and the simple connectivity between Pandas and BigQuery allows us to choose which tool can manipulate the data most efficiently.

Following our data transformation process we’re ready to build our model. There are multiple Python libraries well suited for statistical modeling but for an MMM, we prefer Statsmodels as it is more suited towards traditional econometric interpretation of our input variables.

Statsmodels takes DataFrames created by Pandas, runs our regressions, which we then programmatically decompose the results of and store back in BigQuery. With the model results stored in BigQuery, we can then use the same reporting tools (Chartio) that we used to validate the data to distribute our newly modeled results back to end users. Ingestion, cleaning, modeling and redistribution of data is in one rapid and simple process.

Reducing friction from a process rife with data transfers is essential to getting answers quickly. Cutting out vendor reliance on modeling alleviates the need to send mass data to external agencies. Moving data reporting and ownership from individual UIs to a central repository democratizes necessary model inputs. And lastly, one programmatic workstream to automate ETL (Extract, Transform, Load), data transformations and modeling saves data analysts time in producing necessary modeled insights.

Owning all of this information also gives us full view of all data processing up to modeled results. Having in-house models allows us total insight into potential data gaps, assumptions going into the model and full transparency of modeled outputs in order to validate and maintain a level of statistical integrity.

If this type of work sounds cool to you, come work with us. We’re hiring! Apply here.

Measuring What Makes Readers Subscribe to The New York Times

How we build fast and reusable econometric models in-house

Data Management

Data Validation

Data Modeling

Written by Daniel Mill