Word separators in URLs

In the world of web development and search engine optimisation you find this topic is frequently discussed, yet often without any reasoning or conclusion. Therefore the purpose of this article is to investigate why.

So, let’s start at the very beginning, and find out what “word separators” actually are, and why we need them in URLs.

Traditionally a word separator is a space, yes, an every day space you create with your space-bar key.

The problem with using spaces in URLs is that when the URL is utilised in a browser (for example), the URL is encoded using percent encoding which causes spaces to appear as the encoded “%20”, resulting in an ugly URL formation which is humanly difficult to read.

ie: http://www.example.com/percent%20encoding

How do we overcome the problem? Over the years a workaround has developed…

…the dash, no the hyphen, no in fact it’s the minus sign (yes, I mean this “-” symbol)…

ie: http://www.example.com/not-percent-encoding

The history

It appears that originally the usage of the hyphen comes from it’s appearance in hostnames and domain names.

In the early stages of the internet, groups of developers decided to write RFC documents to introduce ideas of how they believe the internet should work. These RFCs quickly became the standard, which includes details such as what is classed as a valid hostname or domain name is and what isn’t.

We can clearly see the use of the “minus sign” as a separator in RFC 952, although originally they weren’t there to be used in place of spaces, but simply to allow you to add suffixes to hostnames.

Later, the use of “hyphens” appear in RFC 1738 where a standard for URLs is suggested.

Beyond that “hyphens” are mentioned again in RFC 3986 entitled “Uniform Resource Identifier (URI): Generic Syntax”, which clearly defines the URI (and consequently URL) syntax.

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="

unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

From this we can see that a “hyphen” is categorised as an “Unreserved Characters”, unlike “Reserved Characters” which are often used as delimiters and therefore in many programming situations would require “escaping“, however they are not that difficult to work with.

Back in 2004 GoogleGuy on WebmasterWorld clearly states that the period (.), the comma (,) and the hyphen (-) are valid word separators in URLs. He did also mentioned that most people seem to prefer the hyphen, without explanation why…

Present usage

Today the use of hyphens as a word separator in a URL is so common that people are beginning to think that it is an actually standard. It’s not.

And it’s not hard to see why. The amount of open source blogging (such as wordpress) and content management software (such as joomla) out there using them has given people enough reason to never even think twice about it.

But, why is using a hyphen a problem?

Like any workaround there are always going to be problems.

The problem arises over the confusion of a dash, hyphen, and the minus sign. Often they can be seen to be the same symbol and the same symbol used to represent each situation where they may be used as punctuation or to provide context to data.

One situation where there is potential confusion is with names and words that already contain hyphens (eg: Jean-Claude Van Damme).

So, what are the alternatives?

Underscores (_). In the past, we have seen discussion of dashes vs. underscores, and it has been concluded that the use of dashes is considered better than the use of underscores in URLs due to how search engines interpret the input, until now…

Recently it has been stated that Google now views underscores as word separators, in exactly the same way as dashes. However, a quick test of this, and you soon discover that really, this hasn’t happened, even though the heavily popular Wikipedia and Digg.com use the underscore as a word separator in their URLs.

Plus (+). In common urlencoding, we see that spaces are often encoded as this symbol instead of the percent-encoded version (%20). This gives you a good base reason to use this symbol as opposed to any other.

The biggest reason to use these instead is that unlike the hyphen, they are not found in words in the English dictionary, as well as being unheard of them appearing in names. However the one problem I have found is that visually urls encoded this way end up looking like search terms, rather than a static url.

Full stop (.). As it’s an unreserved character we can easily use this in our URLs. The problem is that it may start to get confusing, as urls may look like files, with extensions. Also, in terms of programming, the “.” often means “everything”, especially in perl style regular expressions, which means you may have to escape it when using. It’s worth noting that “.” generally doesn’t get encoded to %2E.

The future

What does this mean for my current sites using dashes?

Ultimately nothing.

Any good webmaster knows that once you have your URLs in place the last thing you want to be doing is abandoning them, unless you rewrite/remap them.

However, next time you’re developing a new site, you can rest assured that you are fully aware of why you decided to use that word separator instead of another.