Duplicate Content Issues

Amanda Watlington
She says its an exciting topic, because she’s seeing a lot of it. (She’s talking very loud and enunciating powerfully).
Typical causes of duplicate content

  • Multiple domains
  • Re-designs
  • cms
  • subdomains
  • landing page
  • syndicated, scraped or shared content

Tools for detection, your ability to search is your best tool, use a 10 word unique snippet, she lists a few tools.
Multiplte domains
they can occur when ownership of domains changes, when domains are purchased, when there is an IT and marketing disconnect or lots of changes in personell.
She’s using the example of http://www.monkbiz.com vs http://www.monkeybuisness.com the second domain was purchased later .
Use the unique snippet and check the content, check for redirections, 200 vs 301, and change them to 301. Remember when there is one there may be more.

site redesign
When a site is redisggned and the urls are changed without an seo plan in place. no type of site is immune. Platform changes (html to php) site with good search traction will simply have a bigger footprint to fix.
she’s using ferretfashioon.com as an example.
Check for more pages in the engines than the site has.
depending on how the site is built this can be done via series of rules, 404 trapping
301 redirects
make sure any actions give the results you want
forward planning beats follow up.
make sure you have a custom 404.

ussually drive directly from the platform, sometimes the results of how products are merchandized, look for the problem at the product level page. she’s using a lightiing store as an example.
This is a very disjointed, badly organized session and the loudness of her speaking is harsh.
check product-level pages.
typically this problem requires a complete URL re-archetecture. The goal is to have a single url associated with a product page, no matter where it is linked to from the categories. this is not easy and the fix depends on the site’s system.

multinationals and coprorate sites often have this problem.
an example is vermin.com which also have information on rats and mice, so they have made rats.vermin.com and they have mice.vermin.com
its not hard to detect if you know the reasons behind it.
to repair this treat them as seperate sites, content strategy should be used to make sure that the same content does not appear more than once.

landing pages
When a webmaster is testing multiple creatives and landings.
look at the search results, this is not a repair issue, this is a prevention issue.
use 301’s if this happens, but use a robots.txt before it happens prevent it before it happens

syndication and scraping
large sites are targets, some are the result of contractual agreements
there are legal issues.
vermins.com makes their content available to rodents.com, this happens often with manufacturers.
normally you can just ask where the content comes from
to fix it you must try to ad value and uniquify the content.
discuss scrapers with your legal department.

Bill Slawski SEObyTheSea
The main problem with duplicate content is that SEs don’t want to show the same content over and over, but which version do they show? are there crawling issues related to having lots of the same page?
Large commercial site with 3500 pages, but google showed 95k pages, and he started seeing some patterns, one page showed up in ggoogle 15k times with different URLs, it was a lotus site. there were little widgets that expanded pieces of data, some pages had 21 of the widgets and each click produced a new URL which was getting indexed. Internal link pop was suffering and not all the pages on the site were being indexed, the widget was changed to javascript that didn’t create new URLs, the rankings went up and the number of pages in google went down.
Sites that practice ecomerce, but take prod descriptions right from manuf. when lots of other people do the same thing you’re hurting yourself. you should get unique content
alternate print pages, you can use alternate styles instead of alternate styles, or use robots or no-index meta tags on the print pages
syndicated feeds, most people like to use full feeds, but bloglines sometimes ranks better than your blog, how do you become the authority
cannonical domain name issues, 301 the non-www urls. it is a flaw with MSN’s algo that redirected pages may be classified as duplicate.
session ids, multiple data variables, keep stuff simple, risk mitigation, keep it simple as possible and don’t let the SEs decide.
pages with content that is too similar, title, meta, navigation are all the same, don’t just change little bits of pages.
copyright infringement some pages get copied so much its not possible to run to the legal dept everytime, the key is to make your page the authority.
subdomains, a site sells subdomains as a premier service, each one had a lot of the same pages on each one.
article syndication is it worth doing article synidation, they bring in pages, if the pages are different enough they might get indexed.
mirror’d sites may get ignored at the time crawling, which mirror’d site shows up.

Don’t Crawll the DUST Different URLs Same Content
there is a white paper and a google tech talk. it looks at some of the basic ways an SE can handle duplicate issues, like trailing slashes, http or www vs non-http, non-www pages.
Shingles are hash based was of comparing pages.
Dustbuster looks at rules for the page how the urls are formed and how there maybe dupes, and decide what it should ignore and what it shouldn’t. like google.com/news news.google.com and http://www.nytimes.com nytimes.com one is a pr10, one is a pr7
the dust paper does not detail which pages are kept which are discarded.
Try to avoid duplicate content as much as possible, you can’t control the people outside of the site so much, but you can control your own site.

Collapsing Equivalent Results.
MSN paper, it tries to go into which pages should be show out of dupes. example
and there are .com mirrors.
how should the SE handle these?
results storage, it keeps all the results, it uses a query independant rank factor, like a PR or page quality, navigational context.
it selects .com because “users prefer .com” users prefer shorter version of the url, prettier URLs. it might also include navigational version, whatever has the less redircts, less latency. you could be clicking on ymca.com but actually being sent to ymca.com/index.jsp. You should look at white papers and stuff a insite into what engineers think may be important. don’t take all of this as gospel. keywords in the URL may also indicate which url will be shown.
Some sites have lots of different TLDs and the local TLD may be the one you’re shown by an SE. which page is more popular like link popularity, which page gets more clickthroughs.
lots of potential ways content can be duplicated, it can negativly influence your site
you should plan carefully, SEs are likely working on solutions, and some descisons on which page to display may surprise you.

Tim Converse Yahoo
He says all the suggestions so far are right on.
A lot of what he does are anti-black hat measures, and he says that might make him talk adversarially.
He uses “dupes” as an abbreiviation.
There are two bad examples of user experience, showing the same content 10 times and only showing one result.
they use crawl filtering, index-time filtering, query time filtering. They only show 2 results from one site.
they often want to retain some results instead of just not crawling them, because They use local preference for searchers, if there is a slight variaiton, and redunancy.
Alternate document forms, legitimat syndication, multiple regional markets, partial-dup pages from boiler plate, and accidental duplication are all legit reasons why people may duplicate content.
accidental duplication maybe due to session ids and soft 404s.
Now he’s switching to blackhat issues.
Dodgy duplication issues
duplication across multiple domains unnessecarily
aggregation can range from useful to abusive.
indentical content with minmal added value
repeated statements
scraper spammers, they attempt to combat this stuff and they do their best to have the orignal content, and they won’t say much about how. this problem is becoming worse.

Brian White Google
He’s going over the different types of dups.
duplicate detection happens at different places in the pipeline. different types of filters are used for different types of duplicated content. The goal is to serve one version of the content.
He says to pick a specific URL and block the non-preferable version, like with robots.txt. use 301s block printable versions.
Do you call the fragments or named anchors? Most people like to call them named anchors. these may cause issues, the whole page ill be used, anything after the # will be ignored.
Boilerplates can cause isssues, lots of boilerplate=lots of simmilarity
the same content in different languages is not duplicate
the same content on local TLDs is not duplicate
multiple domains with the same content should be 301’d.
Syndication: if you syndicate your stuff, include an absolute link to the original issue, if you use syndicated articles, just being aware they may not rank.
there are several ways to handle scrapers, SEs are working on it, google.com/dmca.html and blog bots spoofing googlebot.
make your pages unique and valuable.

Google just said: NoFollow should really be “untrusted”, they will follow no follow links for discovery but the link with the nofollow tag won’t affect rankings.

, , , , ,