Data modeling

The Statistical Modeling System Powering LinkedIn Salary

salary2

Introduction

For most job seekers, salary (or, more broadly, compensation) is a crucial consideration in choosing a new job opportunity. Indeed, more candidates (74%) want to see salary information compared to any other feature in a job posting, according to a survey of over 5,000 job seekers in the United States and Canada. At the same time, job seekers face challenges in learning the salaries associated with different jobs, given the dearth of reliable sources containing compensation data. The LinkedIn Salary product was designed with the goal of providing compensation insights to the world's professionals, thereby helping them make more informed career decisions.

With its structured information, including the work experience, educational history, and skills associated with over 500 million members, LinkedIn is in a unique position to collect compensation data from its members at scale and provide rich, robust insights covering different aspects of compensation while still preserving member privacy. For instance, we can provide insight into the distribution of base salary, bonus, equity, and other types of compensation for a given profession, how these factors vary based on things like location, experience, company size, and industry, and which locations, industries, or companies pay the most.

In addition to helping job seekers understand their economic value in the marketplace, the compensation data has the potential to help us better understand the monetary dimensions of the Economic Graph, which includes companies, industries, regions, jobs, skills, and educational institutions, among other things.

The availability of compensation insights along different demographic dimensions can lead to greater transparency, shedding light on the extent of compensation disparity and thereby helping stakeholders, including employers, employees, and policy makers, take steps to address pay inequality.

Further, products such as LinkedIn Salary can improve efficiency in the labor marketplace by reducing the asymmetry of compensation knowledge and by serving as market-perfecting tools for workers and employers. Such tools have the potential to help students make good career choices by taking expected compensation into account, and to encourage workers to learn skills needed for obtaining well-paying jobs, thereby helping reduce the skills gap.

In this post, we will describe the overall design and architecture of the statistical modeling system underlying the LinkedIn Salary product. We will also focus on unique challenges we have faced, such as the simultaneous need for user privacy, product coverage, and robust, reliable compensation insights, and will describe how we addressed these challenges using mechanisms such as outlier detection and Bayesian hierarchical smoothing.

Problem setting

In the publicly-launched LinkedIn Salary product, members can explore compensation insights by searching for different titles and locations. For a given title and location, we present the quantiles (10th and 90th percentiles, and median) and histograms for base salary, bonus, and other types of compensation. We also present more granular insights on how the pay varies based on factors such as location, experience, education, company size, and industry, and on which locations, industries, or companies pay the most.

The compensation insights shown in the product are based on compensation data that we have been collecting from LinkedIn members. We designed a give-to-get model based on the following data collection process. First, cohorts (such as User Experience Designers in the San Francisco Bay Area) with a sufficient number of LinkedIn members are selected. Within each cohort, emails are sent to a random subset of members, requesting them to submit their compensation data (in return for aggregated compensation insights later). Once we collect sufficient data, we get back to the responding members with the compensation insights and reach out to the remaining members in those cohorts, promising corresponding insights immediately upon submission of their compensation data.

Considering the sensitive nature of compensation data and the desire to preserve the privacy of LinkedIn’s members, we designed our system such that there is protection against data breach, and against inference of any particular individual’s compensation data by observing the outputs of the system. Our methodology for achieving this goal, through a combination of techniques such as encryption, access control, de-identification, aggregation, and thresholding, is described in our IEEE PAC 2017 paper. Next, we will highlight the key data mining and machine learning challenges for the salary modeling system (see our ACM CIKM 2017 paper for more details).

Modeling challenges
Due to privacy requirements, the salary modeling system has access only to cohort-level data containing de-identified compensation submissions (e.g., salaries for UX Designers in the San Francisco Bay Area) and is limited to those cohorts having at least a minimum number of entries. Each cohort is defined by a combination of attributes, such as title, country, region, company, and years of experience, and contains de-identified compensation entries obtained from individuals who all share these same attributes. Within a cohort, each individual entry consists of values for different compensation types, such as base salary, annual bonus, sign-on bonus, commission, annual monetary value of vested stocks, and tips, and is available without an associated user name, ID, or any attributes other than those that define the cohort. Consequently, our modeling choices are limited, since we have access only to the de-identified data and therefore cannot, for instance, build prediction models that make use of more discriminating features not available due to de-identification.

Evaluation: In contrast to other member-facing products, such as job recommendations, we face unique evaluation and data quality challenges with our salary product. Since members themselves may not have a good perception of the true compensation range, they may not be in a position to evaluate whether the compensation insights displayed are accurate. Consequently, it is not feasible to perform online A/B testing to compare the compensation insights generated by different models. Further, there are very few reliable and easily available ground truth datasets in the compensation domain, and even when available (e.g., the BLS OES dataset), mapping such datasets onto LinkedIn's taxonomy is inevitably noisy.

Outlier detection: As the quality of the insights depends on the quality of the submitted data, detecting and pruning potential outlier entries is crucial. Such entries could arise due to either mistakes or misunderstandings during submission, or due to intentional falsification (such as someone attempting to game the system). We needed a solution to this problem that would work even during the early stages of data collection, when outlier detection is more challenging, and there may not be sufficient data across related cohorts to compare.

Robustness and stability: While some cohorts may have a large sample size, many cohorts typically contain very few (< 20) data points each. Given the desire to have data for as many cohorts as possible, we needed to ensure that the compensation insights were robust and stable even when data was sparse. That is, for such cohorts, the insights should be reliable, and not too sensitive to the addition of a new entry. We faced a similar challenge when it came to reliably inferring the insights for cohorts with no data at all.

Our problem can thus be stated as follows. How do we design the salary modeling system to meet the immediate and future needs of LinkedIn Salary and other LinkedIn products? How do we compute robust, reliable compensation insights based on de-identified compensation data (for preserving privacy of members), while addressing the product requirements, such as coverage?

LinkedIn Salary modeling system design and architecture

Our system consists of both an online component that uses a service-oriented architecture for retrieving compensation insights corresponding to the query from the user-facing product, and an offline component for processing de-identified compensation data and generating compensation insights.

salary3

LinkedIn Salary platform: The REST Server provides on-demand compensation insights on request by a REST Client. The REST API allows for retrieval of individual insights, or lists of them. For each cohort, an insight includes (when data is available) the quantiles (10th and 90th percentiles, and median), and histograms for base salary, bonus, and other compensation types. To ensure robustness of the insights in the face of small numbers of submissions and changes as data is collected, we report quantiles, such as the 10th and 90th percentiles and median, rather than absolute range and mean.

LinkedIn Salary use case: For an eligible member, compensation insights are obtained via a REST Client from the REST Server implementing the Salary Platform REST API. These are then presented as part of the LinkedIn Salary product. Based on the product and business needs, the eligibility can be defined in terms of criteria such as whether the member has submitted his or her compensation data within the last year (give-to-get model) or whether the member has a premium membership.

Our Salary Platform has four service APIs to give the information needed for LinkedIn Salary insight pages: (1) a “criteria” finder to obtain the core compensation insight for a cohort; (2) a “facets” finder to provide information on cohorts with insights for filters such as industry and years of experience; (3) a “relatedInsights” finder to obtain compensation insights for related titles, regions, etc.; and (4) a “topInsights” finder to obtain compensation insights by top-paying industries, companies, etc. These finders were carefully designed to be extensible as the product evolves over time. For instance, although we had originally designed the “criteria” finder to provide insights during the compensation collection stage, we were able to reuse and extend it for LinkedIn Salary and other use cases.

Offline system for computing compensation insights: The insights are generated using an offline workflow (discussed in depth here) that consumes the de-identified submissions data (corresponding to cohorts having at least a minimum number of entries) on HDFS and then pushes the results to the Insights and Lists Voldemort key-value stores for use by the REST Server. The offline workflow includes the modeling components, such as outlier detection and Bayesian hierarchical smoothing, which we describe next.

Statistical modeling for compensation insights

Outlier detection
An important goal for the LinkedIn Salary product is accuracy, which is difficult to evaluate, since there are few reliable public datasets with comparable information. We use member-submitted data as the main basis of reported results. Even if submissions were completely accurate, there would still be selection bias; but accuracy of submissions cannot simply be assumed. Mistakes and misunderstandings in submission entry occur, and falsification is possible.

As part of the salary modeling offline workflow, de-identified submissions are treated with three successive stages of outlier detection to reduce the impact of spurious data. The first stage uses sanity-check limits such as the federal minimum wage as lower limit, and an ad-hoc upper bound (e.g., $2M/year in the U.S.) for base salary. The second stage uses limits derived from U.S. Bureau of Labor Statistics (BLS) Occupational Employment Statistics (OES) data, aggregated from federal government mail surveys. This dataset contains estimates of employment rates and wage percentiles for different occupations and regions in U.S. The third stage is based on the internal characteristics of each cohort of submissions, using a traditional box-and-whisker method applied to the data remaining from the second stage.

External dataset-based outlier detection: For outlier detection for U.S. base salary data, we mapped the BLS OES compensation dataset onto LinkedIn titles and regions as follows. There are about 840 BLS occupations with SOC (Standard Occupational Classification) codes; an example is 13-2051, Financial Analysts. To map them to the roughly 25,000 LinkedIn standardized (canonical) titles, first we expanded them to O*NET alternate titles (look for alternate titles data dictionary at www.onetcenter.org). Our example of “Financial Analyst” becomes 53 alternate titles, including Bond Analyst, Chartered Financial Analyst, Money Manager, Trust Officer, etc. To these, we applied LinkedIn standardization software, which maps an arbitrary title to a canonical title. Bond Analyst maps to Analyst, and the other three to themselves. In general, one BLS SOC code corresponds to more than one LinkedIn standardized title. The mapping from BLS to LinkedIn regions was done using ZIP codes. In general, more than one BLS region corresponds to one LinkedIn region. Thus, we had a many-to-many mapping. For each LinkedIn (title, region) combination, we obtained all BLS (SOC code, region) rows that map to it, computed associated box-and-whisker compensation range limits, and aggregated these limits to derive one lower and one upper bound. We obtained the limits for our internal LinkedIn standardized titles and region codes, resulting in more than a million LinkedIn (title, region) pairs. Submissions with base salary outside these limits are excluded.

Outlier detection from member-submitted data: Outlier detection based on member-submitted data itself is done by a box-and-whisker method, applied separately to each compensation type. The box-and-whisker method is as follows: for each compensation type for each cohort, we compute Q1 and Q3, the first and third quartiles respectively, and the interquartile range, IQR = Q3 - Q1. Then, we compute the lower limit as Q1 - 1.5 IQR, and the upper limit as Q3 + 2 IQR. We chose (and tuned) the different factors 1.5 and 2 to reflect the typically skewed compensation distributions observed in our dataset. Submissions with base salary outside the calculated range are excluded. Other compensation-type data are instead clipped to the limits. We do not prune the entire entry, since the base salary is valid, and given that, we do not want to remove outlying values of other compensation types, since that would have the effect of making them zero for the total compensation calculation. We also exclude whole compensation types or even whole cohorts when the fraction of outliers is too large. Note that this third stage is different in kind from the second stage, where the limits do not depend on the submissions received.

Studying the impact of spurious added data: Because of the lack of reliable ground truth data, our focus is on studying the impact of adding spurious data, then winnowing it by the box-and-whisker method. We take the de-identified compensation data collected from more than one million LinkedIn members, after it has gone through all the outlier detection stages, and consider it as valid. This constitutes about 80% of the submissions in cohorts. We limit this data to title-country-region cohorts, to U.S. data of yearly base salary, and to cohorts with at least 20 valid entries, the last so that statistical smoothing does not interfere with interpretation. This leaves about 10,000 cohorts for the study.

For each cohort, we recorded selected quantiles (10th percentile, median, and 90th percentile), then perturbed the cohort in three ways by adding spurious, synthetic salary entries numbering certain fractions (5% through 35% in steps of 5%) of the original numbers of entries. The three methods of perturbation are:      

  • Add the spurious data at the federal minimum wage, $15,080.

  • Add the spurious data at $2M.  

  • Add the spurious data uniformly at random between the 10th and 90th percentiles, the low and high ends.

For each method, we examined the following, averaged over all the cohorts used in the test.  

  • The fraction of the added entries that are removed as outliers; a perfect score is 1.  

  • The fraction of the original entries that are removed as outliers; a perfect score is 0.  

  • The fractional changes in the selected quantiles. This is the most important of these measures in practice, because these are what we report to members.

salary4

For all the data added between the low and high ends, less than 1% of the added or original entries were removed. The reported quantiles changed by less than 5%. For added high entries, about 100% were removed as outliers until the amount of the added data was sufficiently large: at 30%, 92% were removed, and at 35%, none. Less than 1% of the original data were removed as outliers; see the table above for other results. Overall, the box-and-whisker outlier detection, aside from very large amounts of spurious data or when the low end changed from the addition of more than 5% spurious minimum wage data, performed at least reasonably and often quite well.

Bayesian hierarchical smoothing
There is an inherent trade-off between the quality of compensation insights and the product coverage. The higher the threshold for the number of samples for the cohort, the more accurate the empirical estimates are. The lower the threshold for the number of samples, the larger the coverage. Since it is critical for the product to have a good coverage of insights for many different cohorts, obtaining accurate estimates of insights (e.g., percentiles) for cohorts with very few user input samples turns out to be a key challenge. Due to the sparsity of the data, empirical estimates of the 10th or 90th percentile, or even the median, are not reliable when a cohort contains very little data. For example, the empirical estimate of the 10th percentile of a cohort’s compensation with only 10 points is the minimum of the 10, which is known to be a very inaccurate estimate.

We used a Bayesian hierarchical smoothing methodology to obtain accurate estimates of compensation insights for cohorts with very little data. Specifically, for cohorts with large enough sample sizes (i.e., number of samples greater than or equal to a threshold h, say 20), we considered it safe to use the empirical estimates for median and percentiles. On the other hand, for cohorts with sample sizes less than h, we first assumed that the compensation follows a log-normal distribution (see below for validation of the assumption). We then exploited the rich hierarchical structure amongst the cohorts and “borrowed strength” from the ancestral cohorts that have sufficient data to derive cohort estimates. For example, by successively relaxing the conditions, we can associate the cohort “UX designers in the San Francisco Bay Area in internet industry with 10+ years of experience” with larger cohorts such as “UX designers in the San Francisco Bay Area in internet industry,” “UX designers in the San Francisco Bay Area with 10+ years of experience,” “UX designers in the San Francisco Bay Area,” and so forth (illustrated below). We can then pick the “best” ancestral cohort using statistical methods. After the best ancestor is selected, we treat the data collected from the ancestral cohort as the prior to apply a Bayesian smoothing methodology and obtain the posterior of the parameters of the log-normal distribution for the cohort of interest. We also describe how to handle the “root” cohorts (a combination of job title and region in our product) using a prior from a regression model.

salary5

Finding the best ancestral cohort: The set of all possible cohorts forms a natural hierarchy, where each cohort in the hierarchy can have multiple ancestors, as in the example of “UX designers in the San Francisco Bay Area in internet industry with 10+ years of experience” given above. We found the “best” ancestral cohort among all the ancestors of the cohort of interest in the hierarchy, where the ancestor that can “best explain'’ the observed entries in the cohort of interest statistically is considered the best. Formally, we assumed that each such entry is independently drawn at random from the distribution of the ancestral cohort, and selected the ancestral cohort that maximized the likelihood of the observed entries (discussed in depth here).

Bayesian hierarchical smoothing: Given the best ancestor, we viewed the distribution of the ancestral cohort as the prior and obtained the posterior log-normal distribution of the cohort of interest, based on the observed entries. In agreement with our intuition, greater weight will be given to the data based on observed entries as the number of observed entries increases. Formally, we applied logarithmic transformation to all data entries, and then used Gaussian-gamma distribution as the conjugate prior for the mean and precision of the cohort of interest. The posterior mean and variance are then obtained as a function of the mean and variance values for both the ancestral cohort and the cohort of interest, the corresponding sample sizes, and certain hyper-parameters that can be tuned (discussed in depth here). Finally, we obtained the median, and 10th and 90th percentiles based on the posterior distribution.

Smoothing for root cohorts: There can be cases where the root cohorts in the hierarchy do not have enough samples to estimate percentiles empirically. For the LinkedIn Salary product specifically, we considered a root cohort as a function of a job title and a geographical region, and modeled the compensation for these cohorts using a number of features. Simply using parent cohorts, such as title only or region only, as the prior might not be good enough, as there is a lot of heterogeneity for compensation of the same title for different regions (e.g., New York versus Fresno), and that of the same region for different titles as well (e.g., software engineers versus nurses). Hence, for these cohorts, we built a feature-based regression model to serve as the prior for Bayesian smoothing (discussed in depth here).

Inference model for root cohorts: It is desirable to have insights for as many cohorts as possible, and in particular, for cohorts with no data. As a first step, we have extended the feature-based regression model discussed above to also predict insights for the root (title, region) cohorts. We modeled this problem as a robust Bayesian matrix factorization problem, where title is like item, region is like user, and salary for the (title, region) cohort is like rating. A key challenge is to determine when to perform this inference, since the inference may not be reliable when there is not sufficient data for the corresponding job title or for the corresponding geographical region. We addressed this problem by using statistical techniques to determine the minimum sample size support for each title and for each region, needed to perform reliable inference. We have recently further extended this approach to predict insights for (title, company, region) cohorts with no data.

Peer-company groups: We have also investigated the creation of intermediary levels, such as company clusters, between companies and industries. Such peer-company groups are beneficial for improving the hierarchical smoothing (since they could be used as ancestral cohorts for smoothing company-level insights) and also for increasing coverage of insights (since we could present insights for the peer-company group when there is insufficient data for a company). The intuition underlying our computation of peer-company groups is that two companies are similar if many LinkedIn members have moved jobs from the first company to the second company, and vice versa. Formally, we define the similarity score between company u and company v in terms of the combined likelihood of a member transitioning from company u to company v, and vice versa.

Validation of log-normal assumption: We sampled 10,000 entries at random from the set of about 800,000 base salaries collected in the U.S. and generated Q-Q plots in both original and logarithmic scales (shown below). In both plots, the X-axis corresponds to normal theoretical quantiles, and the Y-axis corresponds to the data quantiles for U.S. base salaries in original and logarithmic scales, respectively. We can visually observe that the logarithmic scale is a better fit (closer to a straight line), suggesting that the data on the whole can be fit to a log-normal distribution. Hence, for training the regression model as well as applying statistical smoothing, we used the compensation data in the logarithmic scale. However, this assumption is an approximation, since the observed salaries for an individual cohort may not follow a log-normal distribution.

salary6

Evaluating statistical smoothing: We evaluated statistical smoothing using a “goodness-of-fit” analysis and a quantile coverage test (discussed in depth here). With the former analysis, we observed that combining ancestor and cohort data with statistical smoothing results in better goodness-of-fit for the observed data in the hold-out set, compared to using ancestor or cohort alone. An intuitive explanation is that statistical smoothing provides better stability and robustness of insights. Inferring a distribution based on just the observed entries in the training set for a cohort and using it to fit the corresponding hold-out set is not as robust as using the smoothed distribution to fit the hold-out set, especially when the cohort contains very few entries. In the latter test, we measured what fraction of a hold-out set lies between two quantiles (say, 10th and 90th percentiles), computed based on the training set (1) empirically without smoothing, and (2) after applying smoothing. Ideally, this fraction should equal 80%. We observed that the fractions computed using smoothed percentiles were significantly better than those computed using empirical percentiles. This fraction was 85% with smoothing (close to the ideal of 80%), but only 54% with the empirical approach. We observed similar results for various segments, such as cohorts containing a company, an industry, or a degree. These two tests together establish that statistical smoothing leads to significantly better and more robust compensation insights. By employing statistical smoothing and prediction models, we were able to reduce the threshold used for displaying compensation insights in the LinkedIn Salary product and further infer insights for cohorts with no data at all, thereby achieving a significant increase in product coverage while simultaneously preserving the quality of the insights.

What’s next

We are excited about the immense potential for LinkedIn Salary to help job seekers and the related research possibilities to better understand and improve the efficiency of the career marketplace. We are also currently pursuing several directions to extend this work. Broadly, we would like to improve the quality of insights and product coverage via better data collection and processing, including improvement of the statistical smoothing methodology, better estimation of variance in the regression models, and improved outlier detection. Another direction is to use other datasets (e.g., position transition graphs, salaries extracted from job postings, etc.) to detect and correct inconsistencies in the insights across cohorts. Finally, mechanisms can be explored to quantify and address different types of biases, such as sample selection bias and response bias. For example, models could be developed to predict response propensity based on member profile and behavioral attributes, which could then be used to compensate for response bias through techniques such as inverse probability weighting.

Acknowledgments

This blog post is based on joint work with Stuart Ambler, Xi Chen, Yiqun Liu, Liang Zhang, and Deepak Agarwal. Please refer to the ACM CIKM 2017 paper for more details. This paper received the Best Case Studies Paper Award at CIKM 2017.

salary7

We would like to thank all other members of the LinkedIn Salary team for their collaboration for deploying our system as part of the launched product, and Keren Baruch, Stephanie Chou, Ahsan Chudhary, Tim Converse, Tushar Dalvi, Anthony Duerr, David Freeman, Joseph Florencio, Souvik Ghosh, David Hardtke, Parul Jain, Prateek Janardhan, Santosh Kumar Kancha, Ryan Sandler, Ram Swaminathan, Cory Scott, Ganesh Venkataraman, Ya Xu, and Lu Zheng for insightful feedback and discussions.