Jacques Mattheij

Technology, Coding and Business

1,000,000 Websites

Over the last couple of weeks I’ve run some analysis on the one million most popular sites on the web.

External resources, risky business

What started off as a simple question (how many sites use externally hosted resources) turned into an ever expanding project to answer questions about websites engaging in risky behavior. The reason this interested me is that I believe that every externally hosted javascript is essentially a huge gaping hole in a website. The ways in which externally hosted scripts and other resources can be used against your users are many, some of the more obvious cases are:

  • The hosting site is compromised (by a hacker, or a former or current employee) and the scripts are replaced by something malicious.
  • The company hosting the script goes out of business and a less well intentioned entity takes control of the domain
  • The code is changed on the fly and it breaks your site
  • The request for the code contains a referring url which tells the entity hosting the script who is visiting your pages and which pages they are visiting (this goes for *all* externally hosted content (fonts, images etc), not just javascript)

The damage such a script can do to your users, your website and your reputation is extensive: user data, cookies and sessions can be captured or hi-jacked, giving the controllers of the script access to user identities and control over their accounts as well as the ability to attempt to install malware on your visitors machines (for instance: to steal access credentials for online banking).

The Results

Here are the most interesting results from the analysis:

Of all the domains checked 70%(!) had some form of external content (content not hosted by the same party hosting the original website), and 66% of the total contained externally hosted javascript. You’d think that companies such as banks, hospitals and even governments would be careful with their users data but that does not seem to be the case at all. An important note with respect to ‘what is external content’: If the content on the page referenced embedded resources that did not come from the same domain or a subdomain of that domain then it counted as external content. So if you accessed that resource directly your URL bar would show an entirely different domain in the url. This affects the statistics because some websites use another domain not directly related to the original in an obvious way to speed things up for their users. How big a fraction of the sites is affected by this I do not know but especially for the larger websites this is common practice, for instance, facebook, google and yahoo all do this. (updated because of lots of comments on HN pointing this out)

Jquery, by far the most commonly included library was present on 55% of the sites surveyed and this probably makes the jquery webservers some of the hottest targets to hack. 50% of the domains contained advertising of some form. Google content was embedded on 58% of the pages and facebook on 26%. This gives those companies an excellent angle on expanding the profile of the visitors those websites, after all, even if these are not google properties there is absolutely nothing to stop google or facebook from adding an entry in their user profiles to record the visit.

Flash seems to be very rapidly on the way out, less than 1% of the homepages of the domains I looked at still contained flash content, but keep in mind that these are the larger websites and so presumably they have more budget to stay with the times. On less well maintained sites there is likely percentage wise still more flash. Also, these are the homepages, there is no guarantee that the rest of the site won’t contain a large amount of flash and that the server could have sent a different page if the browser had indicated that it supports flash (which ‘phantomjs’, the headless browser all this work was done with does not).

The practical upshot of this is that on roughly two-thirds of the web you are not just talking to the URL that you see in the browser but also to any one of a number of other parties who have - in principle - absolutely no business knowing about your visit and if their servers are compromised have the ability to distribute malware or other nasty stuff to vast numbers of people.

The Cure

If you have to use externally hosted resources such as javascript libraries then at a minimum you should verify regularly that the code has not changed (you have to hope that you are looking at the same code that your users see) and (this goes for locally hosted code as well) before initial deployment you should verify thoroughly that the code does only what it is supposed to do. Pick the most trustworthy and largest possible party to do the hosting for you to reduce the risk of the domain being hijacked or a replacement snuck in under the radar without it being noticed.

Avoid including resources from smaller companies, copy them to your own server if the license allows it, or find an alternative on a host with good reputation.

By far the safest approach for website owners that care about their users and their users privacy is to simply not include anything at all from other people’s servers. The downside of this is that your users will download a few extra copies of some of the more common libraries but that’s a very small price to pay for a significant reduction in risk. Google analytics junkies in particular will have to weigh whether they feel their users privacy is more important to them than their ability to analyze their users movements on the site. Especially the number of adult sites and anonymous tip sites (for things such as child abuse, crime reporting, bullying reporting) and other sites in a similar vein that contain google analytics tags is food for thought.

As a user the cure is a lot harder to enforce, but installing a plug-in such as Ghostery certainly won’t hurt, it won’t block all third party javascript but it will at least reduce your exposure a bit (on top of making webpages faster to load). (Funny, while verifying the links in the post I noticed that ghostery blocks google and marketo on their own site…)

In the slightly longer term there will be something called Subsource integrity hashes’ which should take care of at least the risks of code being changed out from under the website embedding it (Thanks to HN users mangeletti and cbr).

(dis)Honorable Mentions

Some things that popped up while doing the work that struck me as worth reporting even though they are not about large numbers of websites.

  • Not-so-anonymous: WeTip.com
    wetip.com allows you to anonymously report crimes. Anonymous for small values of anonymous, the site does not even use encryption and on the page with the form it contains an 'add this' widget and a font from google, so I guess that as long as you don't mind those parties and everybody that sits on the pipes between you and wetip.com knowing that you reported a crime you should be fine. And if you end up not being fine then you have another crime to report (maybe use the phone in that case?). This is no the only anonymous tip site that made these (and other) mistakes but it serves as a good example.
  • We-like-requests: Dailymail.co.uk
    The homepage of the DailyMail (that paragon of journalistic virtue) loads a whopping 720 requests and takes over a minute on my broadband connection to completely load.
  • Some major brand websites are using evercookies.
    I'm not quite ready to name them because it may very well be that the included bits and pieces are from an advertising company with questionable ethics but this is really very bad, more to follow.

Many thanks to the creators of phantomjs that made the programming behind this project much simpler than it would have otherwise been.