By Eric Knorr, Contributing writer, InfoWorld |

Hot property: How Zillow became the real estate data hub

The R language, open source analytics software, and a migration to AWS are helping Zillow cement its position as the leading real estate data provider

If you’re buying a home or looking for an apartment, most likely Zillow.com will come to mind first, which is a branding triumph for a website that launched only 10 years ago. Today the Zillow Group is a public company with $645 million in revenue that also operates websites for mortgage and real estate professionals -- and completed the acquisition of its nearest competitor, Trulia, last year.

From the start, Zillow offered the “Zestimate,” its value-forecasting feature for homes in locations across the United States. Currently, Zillow claims to have Zestimates for more than 100 million homes, with 100-plus attributes tracked for each property. The technology powering Zestimates and other features has advanced steadily over the years, with open source and cloud computing playing increasingly important roles.

Last week I interviewed Stan Humphries, chief analytics officer at Zillow, along with Jasjeet Thind, senior director of data science and engineering. With diverse data sources, a research group staffed by a dozen economists, and predictive modeling enhanced by a large helping of machine learning, Zillow has made major investments in big data analytics as well as the talent to ensure visitors get what they want. Together, Humphries and Thind preside over a staff of between 80 and 90 data scientists and engineers.

An analytics platform grows up

Humphries says Zillow’s technology has evolved in three phases, the common thread being the R language, which the company’s data scientists have used for predictive modeling from the beginning. At first, R was used for prototyping to “figure out what we wanted to do and what the analytic solution looked like.” Data scientists would write up specifications that described the algorithm, which programmers would then implement in Java and C++ code.

That system worked reasonably well, says Humphries, but with serious shortcomings:

You could bring in people from the data science side who were great with machine learning and methodologies -- and separate that really expansive, creative thinking on the solution side from the actual implementation side. That was the attractiveness of that model. The downside to that model is that it’s very slow ... You sometimes end up in a suboptimal situation when you’ve let a data scientist think up a solution before letting an engineer think about how it’s actually being implemented.

By the same token, he says, troubleshooting required an awkward round trip. A problem might arise in a production system running C++ or Java, which needed to be diagnosed by a data scientist who was accustomed to working in R.

The second phase of Zillow’s technology development was mainly about developing parallelization frameworks, so more of the production implementation could be done in R and less recoding in C++ and Java would be required. Using R in production required an investment in more powerful hardware, or “scaling vertically by getting bigger and bigger machines,” as Humphries puts it. Additionally, for certain batch jobs such as recomputing the value of homes over decades, Zillow turned to the Amazon cloud.

That brings us to the third phase for Zillow: A migration to AWS (Amazon Web Services), with the long-term goal of moving the entire operation there. At the same time, the company is making the leap from proprietary to open source technologies.

Embracing the cloud and open source

Conventional wisdom says that, although the cloud may be ideal for short-term jobs, the ongoing cost of cloud services becomes too burdensome over the long haul. Not so in the case of Zillow, says Thind:

We did a fairly deep analysis on our cost at the kilobyte level, the amount of data we project we’re going to have. We’re storing a lot of our data in S3, which is relatively cheap, and of course using Glacier to make it even cheaper.

Thind says Zillow is also opting to use the Amazon Kinesis streaming data platform, because he found the cost of using Kinesis “fairly reasonable,” considering it’s a managed service. The other candidate was Kafka, the open source messaging system, but the convenience of Kinesis tipped the balance.

In other cases, open source technologies deployed in the cloud by Zillow have won out. Zillow started with Microsoft SQL Server as its primary data store, but on AWS it’s in the process of moving to Redis, an open source, key-value-pair database. Thind says the company considered migrating to AWS-native database service DynamoDB instead, but it was “just not cost-effective.”

Zillow has also become a major Spark shop. According to Thind:

Zestimate is composed of lots of different types of machine learning models, and we can run a lot of these models in parallel. There are 3,000 counties and we want to be able to run these models in parallel across multiple nodes. With Spark we’re able to achieve that -- run a lot of things in parallel and produce Zestimates much faster and more frequently.

For machine learning, Zillow employs various decision tree, random forest, and regression algorithms and is currently doing prototyping using deep learning. Soon, says Thind, Zillow will use Spark as a deep learning platform as well. “Primarily, we’re leveraging a lot of the machine learning stuff out of R through Spark,” he says. The company is also using Spark streaming for the models to consume the data, with Amazon Kinesis as the underlying platform.

Data in, building out

One effect of Zillow’s leading position is that it has become a canonical data source. “It’s our focus that Zillow be the largest, most trusted, vibrant marketplace for real estate information. That mission is best served by making our data the most authoritative and the most ubiquitous out there,” says Humphries.

Clean real estate data is not easy to come by. According to Humphries, one of Zillow’s core innovations from the start was to ingest data from various standard sources and integrate it into “a reconciled representation of real estate facts.” This meant dealing with a host of different data formats -- plus, not all the data was digital, which meant much had to be keyed in. Altogether, he characterizes it as “very messy, noisy data.”

Humphries lays out some of the thorny issues in sorting fact from fiction:

When you look at a home, does it have three bedrooms or four bedrooms? Two sources say four; the user says three. How do you resolve those? That’s a lot of what we’ve done over the years, create the systems to do that. ... If you sell a house for $100,000, we’ll get it from an MLS system, but we’ll also get it from the public recorder from the deeds and reconcile those facts together. It’s a huge effort around bringing in a lot of disparate data and creating a single representation.

Internally we call that our living database of all homes. We view that as a core competitive asset, in that right now we have the single best representation anywhere of real estate facts, because ours is a superset of every other source that’s out there and it’s a superset that’s been resolved and reconciled.

Zillow uses machine learning algorithms to look at those internal representations, clean them up, and make sophisticated inferences about which facts are outliers and which might be duplicates. The high quality of the result presents some interesting B-to-B possibilities for monetization, but for now, says Humphries, Zillow is “better served by freely distributing all of our data to whoever wants it versus charging folks.” The company even offers free access to a home valuation API.

Besides, Zillow mainly seems focused on delivering a positive user experience. Going forward, Thind says, the company wants to continue to improve its home value indices, rent Zestimates, and other home valuation models. Web clickstream data is analyzed on a regular basis to help UI designers optimize usability, and Zillow plans to add new personalization features for both buyers and sellers.

To me, Zillow is prime example of the effort required to address one vertical slice of big data and produce a quality result. Not only is the Web far from being easily machine readable, but a great deal of accessible data is uneven, while much remains stuck in nondigital form. The “digital transformation” we keep hearing about will occur only when every domain gets the kind of intensive treatment Zillow has applied to U.S. homes.

Next read this:

Eric Knorr is a freelance writer, editor, and content strategist. Previously he was the Editor in Chief of Foundry’s enterprise websites: CIO, Computerworld, CSO, InfoWorld, and Network World.