Friday 26 August 2011

URL Rewriting Guide



What is Mod_Rewrite?
Simply put, mod_rewrite is an Apache module that let’s you rewrite urls based on rules you define. That’s it. Seriously.
Regardless of how confusing some of the rules you may have come across appear to be, all they are doing is taking one url and rewriting as a different url. This rewriting happens at the server level, before the user’s browser sees anything, so the end result is seamless to them.

When you hear about “search engine friendly” urls, you’re most often seeing mod_rewrite in action. Mod_rewrite is the Apache module that let’s you turn a url like:
into:

Some other common uses for mod_rewrite:
  • Directing all traffic from multiple domain names to one domain
  • Directing all traffic from www and non-www to one location
  • Blocking traffic from specific websites
  • Blocking spammy searchbots and offline browsers from spidering your site and eating your bandwidth
  • Mask file extensions
  • Preventing image hotlinking (other web pages linking to images on your server)
Apache’s mod_rewrite can be intimidating if you start where you’re supposed to start – the Apache documentation, however there are some very useful, common – and simple rewrite rules that you may wish to consider implementing into your site development plan, if you’re not doing so already.

Note: If you’re using Microsoft IIS, you have a few options, but I don’t use IIS, so I’m afraid I won’t be of much help to you beyond telling you where to look. ISAPI ReWrite seems to be very popular, and there is a free “lite” version available.

Getting Started

Your mod_rewrite rules typically live in an .htaccess file in your web root. You can only have one .htaccess per directory, but you can have individual .htaccess files in sub-directories under the web root. I generally do not recommend doing this. If mod_rewrite rules from one .htaccess conflict with the rules from the .htaccess in a sub-directory, it can be a real pain in the ass to troubleshoot. Try to avoid it.
When you’re adding mod_rewrite rules to your .htaccess file, you’ll want to start by using a conditional that checks to see if mod_rewrite is installed on your server. This can prevent getting a 500 Internal Server Error if you don’t.


1.<ifmodule mod_rewrite.c>
2.# Start your (rewrite) engines...
3.RewriteEngine On
4. 
5.# rules and conditions go here...
6.</ifmodule>


Directing Multiple Domain Names to a Single Domain Url
If you have multiple domain names pointing to the same site, mod_rewrite can also help you direct all traffic to a single domain url, to improve your search engine rankings. Search engines don’t take too kindly to the same content living at multiple urls – they usually think its an attempt to spam the search engine – and you can actually be penalized for it. To redirect all traffic to one specific domain name,


1.RewriteCond %{HTTP_HOST} !^www\.snipe\.net$ [NC]
2.RewriteRule ^(.*)$ http://www.example.com/$1 [R=301]
This basically says “if the domain requested (the HTTP_HOST) does not match www.example.com then rewrite the url as www.example.com“. (Note the escaping slashes after the www and before the .net in the condition.)  The R=301 specifies that the redirect should be a 301 redirect, meaning that the address has moved permanently and search engines should use the new url instead of the old one.

To www or not to www

Even if you have only one domain name, if you’re not redirecting traffic from the “www” version to the “non-www” version (or vice versa), you may encounter this problem. Whether or not you choose to use the www in your url is largely a branding decision more than anything else (i.e. it doesn’t really matter in most cases) – but you should pick one and stick with it.


Require (force) the www


1.RewriteCond %{HTTP_HOST} !^www\.example\.net$ [NC]
2.RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

Remove the www


1.RewriteCond %{HTTP_HOST} !^snipe\.net$ [NC]
2.RewriteRule ^(.*)$ http://example.com/$1 [R=301,L]


Deny traffic by referrer

There may be a few reasons why you want to block traffic by referrer. Maybe you’re getting a lot of bandwidth-sucking hits from a spammy website – or maybe someone is linking to you that you feel does not represent you very well, and you want to pull the plug on traffic from coming their site.

1.RewriteCond %{HTTP_REFERER} onebadsite\.com [NC,OR]
2.RewriteCond %{HTTP_REFERER} anotherbadsite\.com [NC]
3.RewriteRule .* - [F,L]
In this snippet, the rule is saying “If the referring url is onebadsite.com OR anotherbadsite.com, redirect the user to an HTTP Forbidden error.” The NC specifies that the condition is not case-sensitive, and the OR flag is… well… an “or”. OR is used with multiple RewriteCond directives to combine them with OR instead of the implicit AND.

Keep in mind – this method of blocking traffic is hardly foolproof, at least in the latter of the two scenarios above. If the webmaster of onebadsite.com is linking to you in a way or context you do not want (and you’ve asked them to remove the link), the above method will cause a user on onebadsite.com’s website who has clicked on the link to you from onebadsite.com to hit a Forbidden error. If that user has half a brain, they may very well just google your site name or try to access it later from a bookmark – but it’s a simple measure you can take to keep the idjits out.


Blocking Bad Bots and Spiders

While there is some potential debate as to what is a “bad” bot or spider, the consensus seems to that a bot is bad if they do more harm than good, such as e-mail harvesters, site rippers that download entire websites for offline browsing, etc. Even if bandwidth isn’t so much an issue, I like to block these just on principle.

Please note – this list is not mine – it was directly nicked from a list on JavascriptKit.

01.RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
02.RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
03.RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
04.RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
05.RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
06.RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
07.RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
08.RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
09.RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
10.RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
11.RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
12.RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
13.RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
14.RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
15.RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
16.RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
17.RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
18.RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
19.RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
20.RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
21.RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
22.RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
23.RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
24.RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
25.RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
26.RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
27.RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
28.RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
29.RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
30.RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
31.RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
32.RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
33.RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
34.RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
35.RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
36.RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
37.RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
38.RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
39.RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
40.RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
41.RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
42.RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
43.RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
44.RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
45.RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
46.RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
47.RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
48.RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
49.RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
50.RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
51.RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
52.RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
53.RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
54.RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
55.RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
56.RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
57.RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
58.RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
59.RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
60.RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
61.RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
62.RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
63.RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
64.RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
65.RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
66.RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
67.RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
68.RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
69.RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
70.RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
71.RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
72.RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
73.RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
74.RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
75.RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
76.RewriteCond %{HTTP_USER_AGENT} ^Zeus
77.RewriteRule ^.* - [F,L]

Once again, this method isn’t foolproof. The HTTP_USER_AGENT is quite easily spoofed, and some site ripping applications even allow you to specify what user agent you want to appear as. But if your site is large, implementing this list may make a significant impact on your monthly bandwidth bill.

Mask File Extensions
If for some reason you want to hide the fact that you’re using PHP (or Perl, or whatever), all it takes is a simple line in your .htaccess to have your .php files look like .html files:

1.RewriteRule ^(.*)\.html$ $1.php [R=301,L]
You could even completely obfuscate it if you wanted to, for example serving files that end in .snipe that are really .php files:

1.RewriteRule ^(.*)\.snipe$ $1.php [R=301,L]
In these examples, redirects all files that end in .html (or .snipe) to be served from filename.php so it looks like all your pages are .html (or .snipe) but really they are .php. Notice again that we’re using a 301 redirect.


Prevent Image/File Hotlinking
This snippets prevents people from hotlinking to your files – that is, linking directly to files hosted on your server from their website, thus sucking your bandwidth. It should be noted that in my experience, this rewrite rule seems somewhat spotty, and doesn’t always work, so be sure to test thoroughly.

1.RewriteCond %{HTTP_REFERER} !^$
2.RewriteCond %{HTTP_REFERER} !^http://(www\.)?snipe.net/.*$ [NC]
3.RewriteRule \.(gif|jpg|swf|flv|png)$ /images/dont_steal_bandwidth_jackass.png [R=302,L]

This rule basically says “If the request’s referrer is not blank (meaning the file was accessed directly in a browser) AND is not example.com (case insensitive), rewrite any files that end in .gif, .jpg, .swf, .flv or .png to display the file /images/dont_steal.png.


 Thanks 

Sarang Kinjavdekar

No comments:

Post a Comment