Rel-canonical should be handled with care

Something we've been telling clients for years is to not publish the same information in more than one place. There are many reasons for this from the point of view of web semantics, but the one that makes the clients listen is when we say that Google will penalise their site for it.

As of today Google allow duplicate content as long as you indicate clearly which version is the canonical one. This entails adding something like the following to the HEAD element in your duplicated page, pointing back to the original:

<link rel="canonical" href="/the-other-page" />

This approach has been welcomed by many, but I'm fearful that it is duplicating already-existing web semantics as well as encouraging bad habits in web authors.

Redundancy, or why it's not needed

HTTP and HTML already have two mechanisms for dealing with URLs that contain duplicate content:

Permanent redirects

If URLs are exactly equivalent, the server can send a 301 Permanent Redirect header along with the location of the 'real' content. This is useful in situations where there's no real reason to present different versions of the same page.

Content-location headers

HTTP allows the server to return a content-location header to indicate that the content of the current URL is a duplicate of that at another location. RFC2616 says:

The Content-Location value is not a replacement for the original requested URI; it is only a statement of the location of the resource corresponding to this particular entity at the time of the request. Future requests MAY specify the Content-Location URI as the request- URI if the desire is to identify the source of that particular entity.

... which would seem to duplicate the functionality of rel-canonical with an existing HTTP mechanism. If authors didn't have access to the HTTP headers their server sent, they could embed it directly in their HTML using the META element:

<meta http-equiv="content-location" content="/the-other-page" />

I'm therefore not sure why Google have chosen to re-invent the wheel in this way, my only guess would be that HTML5 is largely deprecating http-equiv in META tags. My main concern is that rel-canonical will start to be used in places where a permanent redirect would be more appropriate.

Google don't follow any public processes for recommendations like this, presumably they have some internal debates but a lot of the time these things hit the web fully-formed and are powered forwards via Google's massive market share until other search engines support them. It's a shame that they don't publish proposals before implementation though, as I'm sure a lot of interested groups would have had some strong opinions to share.

Where rel-canonical should not be used

To fix badly configured webservers

Poorly configured webservers, or rather servers configured without much thought towards URL semantics, appear be the major source of duplicated content. In that there's no reasoned plan behind these duplications, it's far better that the developer reconfigure the server properly than to start adding in markup to their documents.

Some misconfiguration examples include:

  • Site available under www.example.com and example.com. This is rarely a carefully thought out issue, rather it's often just the server's default behaviour. The best bet is to pick one subdomain and permanently redirect the other to it. (It's worth having a look at no-www.org while you're making the decision). I can't think of any strong reason for having it available (as in, being served under rather than redirected from) both domains - the site should be served on whatever subdomain you use on your headed notepaper.
  • /foo and /foo/ and /foo/index.html all point to the same page. Again this is just default behaviour, based on a hierarchical model of websites as folders full of files. This is easily fixed with a bit of config to redirect all of them to /foo, but it's simple enough for even on a simple site to just standardise on 'missing off index.html in hyperlinks so that the duplicates are never referred to'.

In both of these cases, I'd estimate the amount of effort involved in 'fixing' the config to be less than that of each page detecting what URL it's served on and generating an appropriate rel-canonical link.

To allow chaff in URLs

I previously wrote about how when a form is submitted the querystring is full of extraneous information, and how to make the resultant URLs more canonical using redirects.

It would be tempting to do something similar with rel-canonical but a permanent redirect is appropriate when the two URLs represent the same resource. In an example like /search?size=10&colour=&shape=, the meaning of both of the requests is 'items that are size 10'. Therefore it's entirely appropriate to redirect to /search?size=10.

Another case worth looking at is when the results of a search are the same as the results of another. An example might be if the URLs /search?size=10 and /search?colour=red returned the exact same results. This may happen, for instance, if all the red items were only available in size 10.

In this case the two URLs point at clearly different resources ('list of things that are size 10' vs 'list of things that are red'), and the equivalence may change over time. Therefore it's not something we'd want to do a redirect between - is it an appropriate place for rel-canonical?

Google's definition is a bit quiet on the details of what the exact semantics of rel-canonical are, but more importantly for these kinds of resources it'd be completely impractical to generate a list of which other resources might happen to contain the same list of results at the given time.

Where rel-canonical should be used

After all this griping, I can still see some cases where this new semantic might be useful, if we're going to be abandoning the Content-Location HTTP header (or rather, I can think of a few places where content-location would be appropriate that will presumably transfer over).

The commonality between them really is that we don't want to combine the resources, either because we want to slightly vary the way they're presented or because we think that even though they're equivalent at the moment that they might stop being so in the future.

Presenting resources at different points in a site hierarchy

A common sort of URL online is: /red-widgets/widget2000 that contain an identifier for the resource they're looking at as well as some sort of indication of the hierarchy of the site they're in.

Really this sort of slash-separated construction doesn't contain any special meaning it's largely the same as ?category=red-widgets&product=widget2000 except that it contains an implicit hierarchy that hints that you can't have a product identifier without specifying a category.

In a typical application, of course, the page served would contain lots of different contextual stuff such as links to other items in the same category, and maybe some breadcrumbs. A product may exist in a number of different categories, so we would want to hint that the different URLs were somehow equivalent, without redirecting between them.

Enter rel-canonical, this seems like an entirely appropriate usage to me. From a resource point of view the page is something like 'widget2000's details in the red-widgets category', which is distinct from 'widget2000's details in the big-widgets category', but their contents are the same and we can hint that using rel-canonical. Which version is the canonical one I would leave as an exercise to the reader.

Keeping temporarily-equivalent URLs separate

This example is from Google's own pages, but is quite a good one. They quote Wiki type pages, where one wiki entry 'redirects' to another. The way this is normally handled in Wikis is to not actually do a redirect, but present the same content with a note saying it's a redirect.

Why do this? Well, the reasoning seems to be that the resources might at some point stop being equivalent, so it's a good idea to keep them separated out. This would seem to be a reasonable case for a temporary redirect or even a permanent redirect with a short(er) cache header, but in practical terms if the user bookmarks the URL or sends it to a friend, you want them to still have the 'original' URL in their address bar.

This again does seem a reasonable use of rel-canonical, though one I'm less comfortable with.

Future progress

I have a few ideas I'd like to see considered for moving this technology forwards:

  1. Adoption of rel-canonical on A elements as well as just HEAD. A lot of the time (i.e. in the Wiki example) a valid link to the alternate will already exist on the page. If the @rel could be added on to this rather than having to exist in the head, it would save a bit of effort and keep the markup local to the link.
  2. Use of this in Microformats, specifically in the field of finding canonical hCards and so on. There are already efforts to do similar things, but if this semantic becomes widespread it's worth adopting.
  3. Really, a bit more definition of what this @rel value is trying to indicate in terms of REST semantics. The wording from Google has so far been fairly focussed on what it means practically for search indexing, but it'd be good to see them look at it from a more academic point of view.
Bookmark and Share

Comments

1.

I'd have to point out that Google weren't acting unilaterally (not this time around at least), rather this initiative was announced by Google, Yahoo and Microsoft on the same day. See also:

http://ysearchblog.com/2009/02/12/fighting-duplication-adding-more-arrows-to-your-quiver/
http://blogs.msdn.com/webmaster/archive/2009/02/12/partnering-to-help-solve-duplicate-content-issues.aspx

The idea of (for example) search engine companies adapting HTML for their own purposes does leave me feeling uneasy, all the same.

That said, I can't say I prefer either of your methods. A 301 means "moved permanently", and in the case of so-called canonical URLs, nothing has moved. The Content-Location header also seems inappropriate, but then my reading of the paragraph from RFC2616 seems to be the complete opposite of yours! I interpret the semantic as being "here's this other location, I got the relevant stuff from there just now, but it might move" which is far from what canonical URLs are trying to achieve.

The Content-Location header seems to be more approriate for implementation of content-negotation, as in this paper:

http://www.w3.org/TR/cooluris/#conneg

I realise that I'm not offering any constructive alternatives, but then I tend also to subscribe to the philosophy that if you need this device in the first place, you're probably doing it wrong.

Simon Harris
16th February 2009, 15:39

2.

Thanks for the comments, Simon.

I indeed hadn't realised Yahoo! and MS had both implemented the same thing, and the Google announcement I linked to didn't seem to mention it either! I do wonder how they agree these things.

I'm also quite interested in your interpretation of content-location. It's one of those little-used headers so maybe I need to re-read and do some more research on what exactly it means, it's entirely possible you're right.

However, I disagree on a couple of points:

1. The 301 status may be defined as 'Moved Permanently' but there's some wiggle room in that.

In the RFC it says the resource has been moved. I would argue that clearly there was a resource there, because somebody has been able to construct a URL for it. I don't think that the existence of a past specific representation of that resource is 100% necessary, though clearly this is a bit pedantic.

The other argument I'd make is that it's very common to set up a newly-registered domain to 301 redirect to another domain, without first taking time to establish a site there, so common usage of the 301 semantic might have evolved since the RFC was written.

2. I certainly don't think this counts as 'adapting HTML'. HTML4 specifically allows any values you like for @rel, and even encourages you to develop your own semantics. However, if you start defining your own semantics HTML4 says you SHOULD add a @profile to your <HTML>, which this proposal doesn't suggest. It's a slap-on-the-wrists rather than an error though.

Certainly defining new @rel/@rev values to mean specific things is not new, so even if you disagree you may have missed the boat!

Ciaran McNulty
16th February 2009, 15:53

3.

Where this will be useful (unless I'm misunderstanding it) is for shops where you inevitably get a large number of pages with extremely similar content. For example, if you've sorted by price or filtered by colour using text links then the URL might change and the new page would be accessible by Google. You really don't want to confuse Google, so it's a nice way of reminding it which is the master page. Previously this has been handled by nofollow and noindex, but this way seems more elegant - rather than excluding the pages, just communicate the hierarchy to Google.

Ciaron Dunne
20th February 2009, 11:22

Add a comment