Q&A Scalyr’s Steve Newman: Faster Log Queries, Scalable App Monitoring

Aug 14th, 2018 11:30am by Joab Jackson

You may not know Steve Newman‘s name, but chances are you’ve come into contact with his work. He was one of the brains behind Google Docs, the base technology of which Google obtained in its purchase of Newman’s Writely. After settling the technology into Google, he left Google to co-found Scalyr to tackle a new challenge: bring Google-like scalability to log management. Scalyr’s distributed column-based NoSQL database has brought faster operational visibility NBC Universal, CareerBuilder, OKCupid and Scripps.

We spoke with Newman by phone to learn about the history of Writely, the founding of Scalyr, and to get his thoughts on DevOps.

Why did you start Writely?

I started it with a couple of friends, one of whom I’d been working with on pretty everything we had done since college. We started about half a dozen companies together, including Writely. We were interested in was better ways for collaboration on a document. We used to call it the ’emailing-a-Word-file’ problem. You’re working on something and you keep emailing an attachment back-and-forth to your collaborator and losing track of who has which revision.

We had made other attempts at that problem over the years. But basically, one day Sam [Schillace] just came into the office and said, “I’ve got an idea.” And he pretty much sketched Writely. And he had been inspired by Gmail. When we think about Gmail, everyone thinks about gigabytes for free and all the scale and power of the Google infrastructure that’s supporting that. But another thing that novel about Gmail when it launched was it was really the first popular webmail platform that had formatting, that had anything more than plain text in it.

And so, hey, if they can do formatting on the web, we could do formatting on the web. And so, that was where the idea came from. We could build this basic document editor that would let people collaborate online. We built a prototype and we started showing it to people. And we’d talk about collaboration and everyone would just look at it and say, “Collaboration. Well, so that’s like Word? Oh, but it’s a webpage? Oh, Word on the web.” And we would frantically backpedal on that. Because we knew Word had a thousand times more features than Google Docs has even today.

But it was interesting because people really grabbed onto [the aspect] of just using it from anywhere. And the collaboration aspect was almost was secondary.

I seem to recall there were other attempts to do this at the time. Was there one particular technology or thing that Writely did that, say, these other contenders kind of missed?

It was one of those ideas that suddenly seemed to be in the water.

First, we hit a sweet spot of simplicity. we made it an absolutely minimal process to sign in to go the site and get started. I think it was literally about, you type Writely.com into your browser and maybe one more click. Or maybe you actually went directly with zero clicks into a document.

To me, very simple get started and very simple to share. And, frankly, it also just had a basic level of professionalism and polish that was missing in some of the other attempts. And it was really limited at the time what you could do the browser. But, so I think we did a good job. Everyone assumed that it was this huge effort at JavaScript programming, and actually almost everything that was going on at Writely that was interesting was happening on the server because the browsers were still a terrible programming platform. We kept all the heavy lifting in the server, which meant the sharing and the autosave and the revision history and all that.

There was another solution that came out around the same time that was based on a Java applet. And it was massively more powerful than what we had done. It had much more formatting and document layout capabilities, so forth. But it took like a minute literally to load. And, that was just a non-starter. And there was another, from 37Signals [now Basecamp]. So that outfit was a very well respected organization. Around the time we were getting ready to launch Writely, it was about to launch some called Writeboard. And we panicked because we thought, “Oh, here comes, 37Signals is gonna do something amazing and everyone knows about them and we’re done.” Writeboard was very polished, as you’d expect, but it was based Markdown. And so, it was just serving a different audience.

What was going to Google like after launching a start-up?

Yeah, that was an intense experience. Around the time the acquisition was closing, we were having a planning meeting with the hardware and network provisioning team at Google and it was a little bit of a confused conversation because we were just to working at such different scale. Writely, at the time of acquisition, was running on four servers. And so, in Google terms, we needed zero racks, zero gigabytes of uplink. Like, we weren’t even a rounding error on any internet they were used to working.

Well, of course, that changed very quickly after it relaunched as Google Docs. So, there was a very intense three month period after the acquisition where we just totally re-wrote the entire system to move onto Google infrastructure. So, prior to the acquisition, this was C# code running in IIS on Windows servers in a data center somewhere in Dallas, with our own little homegrown object store storing different documents on the local disk on the server. And then three months later, it was Java code running in Google’s servlet container on Linux in Google data centers that stored documents on Bigtable. Every layer was completely redone.

So, how did Scalyr come about?

I trace it back to the experience at Google. Google was an amazing learning experience, certainly for me and I think everyone on the team. There’s no company that’s better at scale and thinking big. One of the things about Google is they’re good at scale and they’re not afraid of it and they’re afraid of the technical challenge. If there’s a product idea that makes sense, if there’s need that needs to be filled and it happens to be difficult to fill it, they’re not gonna shy away from it.

I wound up working at Google on some big database infrastructure projects that were supporting, among other things, Google Docs. And it was very complex operationally and there were lots of problems would come up. Performance problems. One of the teams we work with has a huge traffic spike or has some outage or, something goes wrong and so there were. You’ve got all these moving parts, lots of places for problems to arise and then when a problem arises, you necessarily know where to look. And the relationships are so complex that it’s hard to narrow down the what’s cause and what’s effect. And the tools that we had available within Google weren’t that great at sorting through the mass of data. We were collecting all this data, log data, and metrics, and other kinds of data.

We were using — we counted up at one point — literally 17 different tools, just to sort through all that data. And they each solved a different problem. We needed all 17 — And it kind of still wasn’t enough. And so, the idea for Scalyr was to give people a better way to solve the problem we had had, to provide better tools for taking this huge mass of logs and traces and metrics and all your other data coming from 1,000 different places, and actually be able to track down a problem in a reasonable amount of time. So, that was sort of the high-level motivation.

Deliberative tracing is good at giving you an overall picture of the performance of your system. And, maybe something is taking a long time, it can tell you, “Well, the reason it’s taking a long is your database is slow to respond to it.” Why is the database slow to respond? Well, that might turn out to be that we’ve traffic spike and you’re overloading the database. Which may not be very obvious. It may be that instead of sending 1,000 queries per second, you’re now sending 1,010 queries per second. But what it turns out is those 10 extra queries are very expensive.

You’re getting the same number of people running their shopping carts but meanwhile, someone over in the business group has been running these big expensive analytics queries. It’s only one query at a time, but it’s completely tying up the database. These problems can be very complex and very subtle and the tracing give you only one level of the picture.

So how does Scalyr help?

There are really two reasons it helps. One is we’ve built a very streamlined NoSQL database instead of what other log management solutions are using, which is invariably keyword index.

With keywords, you think of logs, you think of search. That’s how everything from Google Web Search on down does search. The keyword index technology was originally developed for document retrieval, but it turns out to not be very good fit for this kind of machine data, server logs, application logs.

Here, you’re looking for individual log messages, which are tiny. There are many more of them and they’re much smaller than web pages. They turn over very quickly. They’re full of record ID’s and IP address and kinds of gobbledygook that aren’t words. So instead of thousands of unique vocabulary entries, you have millions or billions of vocabulary entries. And there’s never the top 10. There’s no show me the best 10 times my server crashed.

Whatever you’re looking for, you wanna find every time it happened and summarize that in graph or some other kind of visualization. So, all of the of indexing technology, like Elasticsearch doesn’t really help in this problem space. And so, we basically threw all that out and said, “If we wanna search the logs, let’s just search the logs. Let’s just literally scan through the data byte by byte.”

And the other thing that we’ve done is moved toward hosted managed services, where there’s an opportunity there for economies of scale. But you have to architect the system from the beginning to take advantage of that. So, if you look at other log management solutions, maybe they’re in a hosted solution, but the host solutions tend to dedicate hardware per customer. So, you only get through your little share of horsepower to analyze your logs with.

We’ve architected more like Google Web Search. We’re not quite at the Google scale yet but still have a very, what in log management is an extremely large cluster. And each of our customers gets to use that entire cluster.

So, what is the setup like for the end user? The organization? It sounds like you’re just installing in an agent. Is this with the software? Is it inside the software? Is it a sidecar? What’s up with that?

Yeah, so, we support a bunch of different models because we want to meet customers where they are. So it depends, are you running your own servers or at least virtual servers? Or are you in Kubernetes or serverless? Or, there’s a lot of different models people may have. You may already have some kind of logging centralization infrastructure in place.

We have a lightweight agent you can just drop in. It comes packaged for, you need it for, straight, server container installation or for dock or Kubernetes. And then we have a lot of API integration. So if you’re running serverless or just Amazon or some other cloud provider is running Fortran, they’re running the database or doing the system or whatever other things. The API’s can just reach out on your behalf and collect those logs. So, we can meet you wherever you’re at.

So, it’s really easy to go in and ask a question you’ve never asked before, if you’re troubleshooting a novel problem. You don’t really know where to look, so you’re exploring a lot of, “Is it this? Is it this? Is it this? Is it this? Oh, maybe it has something to do with that. Let me scroll down.” So you’re just firing a lot of scattershot queries and rummaging around until you figure out what’s going on. That’s where we, I think, especially shine.

Maybe the thing I’m most gratified by in kind of how we’ve seen the reception of the product is: We’ll go into an organization and some of our larger customers may have something like 1,000 people who have been given Scalyr logins. And half of them are in there every week and a quarter of them every day, which is a huge engagement for a tool like this.

When people think about Google they think about scale and incredibly intelligent search. But one of the other things that Google really changed the game on in the early days was speed. Earlier search engines were a lot slower. But when you make something faster, people use it a lot more. And I think that’s what we’re bringing here.

What are your thoughts about DevOps going forward? What are the things that we should be looking at when it comes to reporting out in DevOps these days?

DevOps provides both opportunities but also challenges. We were talking yesterday with one of our customers, and they’re just starting their move into serverless, but they already have 750 functions deployed on Lambda. And, even if you’re used to running microservices, you’re not used to 750 of anything. How do you trace through that? How do you find a root cause?

I think this is one of those transitions that’s gonna take years and years to play out. People moving some of their code into Lambda is going to be the first of about 20 steps. And the reason is, it’s moving to a model where a provider like Amazon, with real experts, and with the scale to really put a lot of work into a problem, can start providing a lot more value.

So it’s not just sort hosting code and auto-scaling the code, but the kind of things that people really struggle with that pertain to reliability of services — things like rate limiting, protecting against denial of service and AB testing both for user behavior but also just sort of a way of detecting bugs or problems with service. When you decouple what you’re doing a little bit and give the cloud provider the ability to get in there and control and therefore add value at a much more granular layer in what you’re doing, I think the ramifications for that are going to take years to play out.

Google is a sponsor of The New Stack.

Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 25 years, including stints at IDG and Government Computer News. Before that, he...