Using Robots.txt to Keep Your Joomla Pages Under Control

The technical side of Joomla SEO can be summed up in one sentence: keep your URLs under control.

Joomla really is a powerful tool for creating content-rich websites but its also easy to end up with a whole lot of useless URLs.

In today’s post, we’ll use MosTree as an example of how to manage Joomla URLs, using the wonderful sounding robots.txt file.

In recent weeks, I’ve blogged about a few ways to make MosTree more Search Engine Friendly (first update, second update), and we’ve also talked about how having a few high-quality pages on your site is much better than having a lot of low-quality pages. This post is a follow-up to both of those.

A Little Background for this Example

Last year we launched JoomlaYellowPages.com. The site has done well, and now lists nearly 200 Joomla companies worldwide. It’s also done well in SEO terms. Search for Africa Joomla, Asia Joomla, Europe Joomla or any other geographic region and Joomla, and theres a good chance that JoomlaYellowPages.com will be high in the results.

What was the Problem?

We noticed very early on that Google was also indexing multiple pages for each listing, including the contact form, recommend page and others. One company = 4 or 5 URLs.

For example:

joomlayellowpages.com/listings/north_america/united_states/georgia/alledia/details/

joomlayellowpages.com/listings/north_america/united_states/georgia/alledia/contact/

joomlayellowpages.com/listings/north_america/united_states/georgia/alledia/review/

joomlayellowpages.com/listings/north_america/united_states/georgia/alledia/claim/

What Was The Solution?

What we did was use our robots.txt file, located in the root of the site, to stop Google indexing all the extra pages.

Normally if you have a component producing 100s of extra URLs, you can simply block the whole script from being indexed:

Disallow: /badcomponentforseo/

However, in this case we needed a scalpel rather than a sledgehammer. We wanted to Google to index only certain parts of MosTree and ignore the rest. So we used a wildcard * symbol to block all URLs with a specific beginning and ending, regardless of what was in the middle:

Disallow: /listings/*/*/*/review

Disallow: /listings/*/*/review

Disallow: /listings/*/*/*/Add_Listing

Disallow: /listings/*/*/Add_Listing

Disallow: /listings/*/Add_Listing

Disallow: /listings/*/*/*/Add_Category

Disallow: /listings/*/*/Add_Category

Disallow: /listings/*/Add_Category

Disallow: /listings/*/*/*/contact

Disallow: /listings/*/*/contact

Disallow: /listings/*/*/*/recommend

Disallow: /listings/*/*/recommend

How Can I Apply This To My Site?

Regularly check what kind of pages Google is indexing on your site and look for patterns. If there are a lot of PDF pages, or dozens of useless links from a particular component, you can act quickly to block them out with robots.txt. Use the site:mydomain.com search function or a tool such as WebCEO.com.

Among the most important things you can do is check your pages that are in Google’s supplemental index. This is where you’ll find lots of your low-quality pages, ripe for removal by robots.txt. If the pages don’t contain useful information, dump them.

Read More About Robots.txt 

Originally the wildcard wasn’t supported by robots.txt but that has since changed. Both Google and Yahoo now recognize it:

 

Leave a Reply

Your email address will not be published. Required fields are marked *