Showing posts with label Screaming Frog. Show all posts
Showing posts with label Screaming Frog. Show all posts

Friday, 23 November 2012

Common Technical SEO Problems and How to Solve Them

I love technical SEO (most of the time). However, it can be frustrating to come across the same site problems over and over again. In the years I've been doing SEO, I'm still surprised to see so many different websites suffering from the same issues.

This post outlines some of the most common problems I've encountered when doing site audits, along with some not-so-common ones at the end. Hopefully the solutions will help you when you come across these issues, because chances are that you will at some point!

1. Uppercase vs Lowercase URLs

From my experience, this problem is most common on websites that use .NET. The problem stems from the fact that the server is configured to respond to URLs with uppercase letters and not to redirect or rewrite to the lowercase version.  
I will admit that recently, this problem hasn't been as common as it was because generally, the search engines have gotten much better at choosing the canonical version and ignoring the duplicates. However, I've seen too many instances of search engines not always doing this properly, which means that you should make it explicit and not rely on the search engines to figure it out for themselves.
How to solve:
There is a URL rewrite module which can help solve this problem on IIS 7 servers. The tool has a nice option within the interface that allows you to enforce lowercase URLs. If you do this, a rule will be added to the web config file which will solve the problem.
More resources for solutions:

2.  Multiple versions of the homepage

Again, this is a problem I've encountered more with .NET websites, but it can happen quite easily on other platforms. If I start a site audit on a site which I know is .NET, I will almost immediately go and check if this page exists:
www.example.com/default.aspx
The verdict? It usually does! This is a duplicate of the homepage that the search engines can usually find via navigation or XML sitemaps.
Other platforms can also generate URLs like this:
www.example.com/index.html
www.example.com/home
I won't get into the minor details of how these pages are generated because the solution is quite simple. Again, modern search engines can deal with this problem, but it is still best practice to remove the issue in the first place and make it clear.
How to solve:
Finding these pages can be a bit tricky as different platforms can generate different URL structures, so the solution can be a bit of a guessing game. Instead, do a crawl of your site, export the crawl into a CSV, filter by the META title column, and search for the homepage title. You'll easily be able to find duplicates of your homepage.
I always prefer to solve this problem by adding a 301 redirect to the duplicate version of the page which points to the correct version. You can also solve the issue by using the rel=canonical tag, but I stand by a 301 redirect in most cases.
Another solution is to conduct a site crawl using a tool like Screaming Frog to find internal links pointing to the duplicate page. You can then go in and edit the duplicate pages so they point directly to the correct URL, rather than having internal links going via a 301 and losing a bit of link equity.
Additional tip - you can usually decide if this is actually a problem by looking at the Google cache of each URL. If Google hasn't figured out the duplicate URLs are the same, you will often see different PageRank levels as well as different cache dates.
More resources for solutions:

3. Query parameters added to the end of URLs

This problem tends to come up most often on Ecommerce websites that are database driven. There of a chance of occurrence on any site, but the problem tends to be bigger on eCommerce websites as there are often loads of product attributes and filtering options such as colour, size, etc. Here is an example from Go Outdoors (not a client):
In this case, the URLs users click on are relatively friendly in terms of SEO, but quite often you can end up with URLs such as this:
www.example.com/product-category?colour=12
This example would filter the product category by a certain colour. Filtering in this capacity is good for users but may not be great for search, especially if customers do not search for the specific type of product using colour. If this is the case, this URL is not a great landing page to target with certain keywords.
Another possible issue that has a tendency to use up TONS of crawl budget is when said parameters are combined together. To make things worse, sometimes the parameters can be combined in different orders but will return the same content. For example:
www.example.com/product-category?colour=12&size=5
www.example.com/product-category?size=5&colour=12
Both of these URLs would return the same content but because the paths are different, the pages could be interpreted as duplicate content.
I worked on a client website a couple of years back who had this issue. We worked out that with all the filtering options they had, there were over a BILLION URLs that could be crawled by Google. This number was off the charts when you consider that there were only about 20,000 products offered.
Remember, Google does allocate crawl budget based on your PageRank. You need to ensure that this budget is being used in the most efficient way possible.
How to solve:
Before going further, I want to address another common, related problem: the URLs may not be SEO friendly because they are not database driven.  This isn't the issue I'm concerned about in this particular scenario as I'm more concerned about wasted crawl budget and having pages indexed which do not need to be, but it is still relevant.
The first place to start is addressing which pages you want to allow Google to crawl and index. This decision should be driven by your keyword research, and you need to cross reference all database attributes with your core target keywords. Let's continue with the theme from Go Outdoors for our example:
Here are our core keywords:
  • Waterproof jackets
  • Hiking boots
  • Women's walking trousers
On an eCommerce website, each of these products will have attributes associated with them which will be part of the database. Some common examples include:
  • Size (i.e. Large)
  • Colour (i.e. Black)
  • Price (i.e. £49.99)
  • Brand (i.e. North Face)
Your job is to find out which of these attributes are part of the keywords used to find the products. You also need to determine what combination (if any) of these attributes are used by your audience.
In doing so, you may find that there is a high search volume for keywords that include "North Face" + "waterproof jackets." This means that you will want a landing page for "North Face waterproof jackets" to be crawlable and indexable. You may also want to make sure that the database attribute has an SEO friendly URL, so rather than "waterproof-jackets/?brand=5" you will choose "waterproof-jackets/north-face/." You also want to make sure that these URLs are part of the navigation structure of your website to ensure a good flow of PageRank so that users can find these pages easily.
On the other hand, you may find that there is not much search volume for keywords that combine "North Face" with "Black" (for example, "black North Face jackets"). This means that you probably do not want the page with these two attributes to be crawlable and indexable.
Once you have a clear picture of which attributes you want indexed and which you don't, it is time for the next step, which is dependant on whether the URLs are already indexed or not.
If the URLs are not already indexed, the simplest step to take is to add the URL structure to your robots.txt file. You may need to play around with some Regex to achieve this. Make sure you test your regex properly so you don't block anything by accident. Also, be sure to use the Fetch as Google feature in Webmaster Tools. It's important to note that if the URLs are already indexed, adding them to your robots.txt file will NOT get them out of the index.
If the URLs are indexed, I'm afraid you need to use a plaster to fix the problem: the rel=canonical tag. In many cases, you are not fortunate enough to work on a website when it is being developed. The result is that you may inherit a situation like the one above and not be able to fix the core problem. In cases such as this, the rel=canonical tag serves as a plaster put over the issue with the hope that you can fix it properly later. You'll want to add the rel=canonical tag to the URLs you do not want indexed and point to the most relevant URL which you do want indexed.
More resources for solutions:

4. Soft 404 errors 

This happens more often than you'd expect. A user will not notice anything different, but search engine crawlers sure do.  
A soft 404 is a page that looks like a 404 but returns a HTTP status code 200. In this instance, the user sees some text along the lines of "Sorry the page you requested cannot be found." But behind the scenes, a code 200 is telling search engines that the page is working correctly. This disconnect can cause problems with pages being crawled and indexed when you do not want them to be.
A soft 404 also means you cannot spot real broken pages and identify areas of your website where users are receiving a bad experience. From a link building perspective (I had to mention it somewhere!), neither solution is a good option. You may have incoming links to broken URLs, but the links will be hard to track down and redirect to the correct page.
How to solve:
Fortunately, this is a relatively simply fix for a developer who can set the page to return a 404 status code instead of a 200. Whilst you're there, you can have some fun and make a cool 404 page for your user's enjoyment. Here are some examples of awesome 404 pages, and I have to point to Distilled's own page here :)
To find soft 404s, you can use the feature in Google Webmaster Tools which will tell you about the ones Google has detected:
You can also perform a manual check by going to a broken URL on your site (such as www.example.com/5435fdfdfd) and seeing what status code you get. A tool I really like for checking the status code is Web Sniffer, or you can use the Ayima tool if you use Google Chrome.
More resources for solutions:

5. 302 redirects instead of 301 redirects

Again, this is an easy redirect for developers to get wrong because, from a user's perspective, they can't tell the difference. However, the search engines treat these redirects very differently. Just to recap, a 301 redirect is permanent and the search engines will treat it as such; they'll pass link equity across to the new page. A 302 redirect is a temporary redirect and the search engines will not pass link equity because they expect the original page to come back at some point.
How to solve:
To find 302 redirected URLs, I recommend using a deep crawler such as Screaming Frog or the IIS SEO Toolkit. You can then filter by 302s and check to see if they should really be 302s, or if they should be 301s instead.
To fix the problem, you will need to ask your developers to change the rule so that a 301 redirect is used rather than a 302 redirect.
More resources for solutions:

6. Broken/Outdated sitemaps

Whilst not essential, XML sitemaps are very useful to the search engines to make sure they can find all URLs that you care about. They can give the search engines a nudge in the right direction. Unfortunately, some XML sitemaps are generated one-time-only and quickly become outdated, causing them to contain broken links and not contain new URLs.  
Ideally, your XML sitemaps should be updated regularly so that broken URLs are removed and new URLs are added. This is more important if you're a large website that adds new pages all the time. Bing has also said that they have a threshold for "dirt" in a sitemap and if the threshold is hit, they will not trust it as much.
How to solve:
First, you should do an audit of your current sitemap to find broken links. This great tool from Mike King can do the job.
Second, you should speak to your developers about making your XML sitemap dynamic so that it updates regularly. Depending on your resources, this could be once a day, once a week, or once a month. There will be some development time required here, but it will save you (and them) plenty of time in the long run.
An extra tip here: you can experiment and create sitemaps which only contain new products and have these particular sitemaps update more regularly than your standard sitemaps. You could also do a bit of extra-lifting if you have dev resources to create a sitemap which only contains URLs which are not indexed.
More resources for solutions:

A few uncommon technical problems

I want to include a few problems that are not common and can actually be tricky to spot. The issues I'll share have all been seen recently on my client projects.

7. Ordering your robots.txt file wrong

I came across an example of this very recently, which led to a number of pages being crawled and indexed which were blocked in robots.txt.
The reason that the URLs in this case were crawled was because the commands within the robots.txt file was wrong. Individually the commands were correct, but they didn't work together correctly.
Google explicitly say this in their guidelines but I have to be honest, I hadn't really come across this problem before so it was a bit of a surprise.
How to solve:
Use your robots commands carefully and if you have separate commands for Googlebot, make sure you also tell Googlebot what other commands to follow - even if they have already been mentioned in the catchall command. Make use of the testing feature in Google Webmaster Tools that allows you to test how Google will react to your robots.txt file.

8.  Invisible character in robots.txt

I recently did a technical audit for one of my clients and noticed a warning in Google Webmaster Tools stating that "Syntax was not understood" on one of the lines. When I viewed the file and tested it, everything looked fine. I showed the issue to Tom Anthony who fetched the file via the command line and he diagnosed the problem: an invisible character had somehow found it's way into the file.  
I managed to look rather silly at this point by re-opening the file and looking for it!
How to solve:
The fix is quite simple. Simply rewrite the robots.txt file and run it through the command line again to re-check. If you're unfamiliar with the command line, check out this post by Craig Bradford over at Distilled.

9.  Google crawling base64 URLs

This problem was a very interesting one we recently came across, and another one that Tom spotted. One of our clients saw a massive increase in the number of 404 errors being reported in Webmaster Tools. We went in to take a look and found that nearly all of the errors were being generated by URLs in this format:
/aWYgeW91IGhhdmUgZGVjb2RlZA0KdGhpcyB5b3Ugc2hvdWxkIGRlZmluaXRlbHkNCmdldCBhIGxpZmU=/
Webmaster tools will tell you where these 404s are linked from, so we went to the page to findout how this URL was being generarted.  As hard as we tried, we couldn't find it. After lots of digging, we were able to see that these were authentication tokens generated by Ruby on Rails to try and prevent cross site requests. There were a few in the code of the page, and Google were trying to crawl them!  
In addition to the main, problem, the authentication tokens are all generated on the fly and are unique, hence why we couldn't find the ones that Google were telling us about.
How to solve:
In this case, we were quite lucky because we were able to add some Regex to the robots.txt file which told Google to stop crawling these URLs. It took a bit of time for Webmaster Tools to settle down, but eventually everything was calm.

10. Misconfigured servers

This issue is actually written by Tom, who worked on this particular client project. We encountered a problem with a website's main landing/login page not ranking. The page had been ranking and at some point had dropped out, and the client was at a loss. The pages all looked fine, loaded fine, and didn't seem to be doing any cloaking as far as we could see.

After lots of investigation and digging, it turned out that there was a subtle problem caused by a mis-configuration of the server software, with the HTTP headers from their server.

Normally an 'Accept' header would be sent by a client (your browser) to state which file types it understands, and very rarely this would modify what the server does. The server when it sends a file always sends a "Content-Type" header to specify if the file is HTML/PDF/JPEG/something else.

Their server (they're using Nginx) was returning a "Content-Type" that was a mirror of the first fiel type found in the clients "Accept" header. If you sent an accept header that started "text/html," then that is what the server would send back as the content-type header. This is peculiar behaviour, but it wasn't being noticed because browsers almost always send "text/html" as the start of their Accept header.

However, Googlebot sends "Accept: */*" when it is crawling (meaning it accepts anything).
(See: http://webcache.googleusercontent.com/search?sourceid=chrome&ie=UTF-8&q=cache:http://www.ericgiguere.com/tools/http-header-viewer.html)
I found if I sent a */* header this caused the server to fall down as */* is not a valid content-type and the server would crumble and send an error response.

Changing your browsers user agent to Googlebot does not influence the HTTP headers, and tools such as web-sniffer also don't send the same HTTP headers as Googlebot, so you would never notice this issue with them!

Within a few days of fixing the issue, the pages were re-indexed and the client saw a spike in revenue.