Site Structure for Crawability

Tim Converse Yahoo
He doesn’t work for the crawler group, but he knows that they find annoying. Crawlers are simple minded. Visit URL, store contents. Extract all links, decide when to crawl those links and decide when to refresh that page. Their crawlers runs continuously. External links and domain registration is how they find new domains. Crawlers are behind a few years from modern browsers, as in javascript, flash, css. your links should be in vanilla html, don’t rely on fancy other things being or not-being processed.
The perfect site for crawlers:
all pages reachable from root page, tree structure. Link to a sitemap.
links should be extractable in plain html, view source, or use old school webbrowser.
if every distinct URL matches up with distinct content, multple URLs to same cotnetn looks like dupe
limit dynamic paraments,
sessions and cookies should not determind content.
they won’t crawl stuff that is blocked in robots.
they do pay attention to internal anchor text.
Dynamic sites
constructed on the fly, almost an obsolete meaning. the url has arguments after the question mark. SE’s look at the URL if there is lots of parameters, its likely to be duplicated.
stay away from sessid urls.
map non-? urls to dymamic content
or provide session id free way to navigate the site.
soft-404 traps are bad. if the URL is bogus, send a 404.
worst case is a status 200 on a bad URL with a link that doesn’t exist, soft-404 trap makes it hard to get real content.

robots.txt
evil bots may not obey, but all major SEs will.
don’t accidentally screen good bots out, often robots will screan every bot but a single one.
yahoo added extensions to robots.txt: regexp syntax. google also supports the same syntax, use to screen out dupes.
when the rules disagree, the longest pattern wins.

moving sites
301 to new site and hang on to old domain as long as possible, as much as they can they will migrate ranking to a new site (this is being reworked). Map old paths to the new paths rather than just redirect to the root.

site explorer
links, when the last crawl was, what the representation in the index.
authenticate site at siteexplorer and specific a feed of URls to crawl.

He’s now showing a list of resources and help pages.

Brett says there will be a big annoucement from the SEs tomorrow at the super session about this stuff!

Vanessa Fox Google
What ever will work out well for the visitors will work our well for the SEs, how well can a visitor navigate your site, are all the pages crawlable via links, how accessible is your site, ie with extras turned off or a mobile browser.
Take an objective look using a text browser, have someone look at your site and see how easily they can find things.
Use technology wisely: have alternate ways to access stuff. Don’t use splash pages, minimize flash, use alt tags, don’t put text in images, don’t use frames, minimize javascript. Use transcripts for videos.
Get the most out of your links: use anchor text links, minimize redirects, make sure every page on your site is accesible by static text links. make sure you have an HTML site map and link to it from your main page.
Webmaster Central: crawl errors, bot activity, robots.txt problems, use a sitemap file.
Check your site in the SERPs: she’s using an example where a site’s site: command shows redirect pages, bad title tags, incorrectly optimzed flash pages.
Is only the URL in the SERPs that means you may be blocking acces to a crawl
if all titles and descriptions are the same that is bad, have unique description and title tag.
if your description is loadin.. loading… loading, you’ve got problems.
She’s showing webmaster central.

Mark Jackson
Don’t haphazrdly jump into a redesign. he’s showing a picture of micheal jackson young and old it says “Be Careful”.
think about SEO early in the process, begining to end. Make sure you have the content. Use keyword research for Information Archecture. Balance cool design with strategic, no flash sites. assign keywords to each page and each URL, 3 phrases per page. validate the site, make it section 508 compliant. copy should compliment keywords. Use 301s.
Information architecture
Authority sites are very deep. write good content for each product, put keywords in descriptions. keep old content, like archives newsletters. Study analytics, current pages could be ranking.
Size matters. wikipedia is a very very deep site, its an authority site.
SE friendly doesn’t have to be ugly (he shows and ugly site)
He’s using an exmaple of a site with a flash popup on the homepage. you should keywords on your homepage and have navigation.
avoid image, flash or javascrip navigation. it is better to use css/text navigation. use keywords in anchor text, name pages using keywords.
validate all pages on the site, esp the homepage and sitemap. limit use of javascript, put it in a includes file, use CSS. watch page load and times. host on apach or IIS and use URL rewriting.
he’s using an example of cookies by design where they cut the page size from 120k to 26k and put keywords in left-hand navigation.
Don’t use spacer.gif The site ranks pretty well. (#19 on gifft baskets?)
250 words of content for interior pages, 400 on homepage, no keyword stuff, make it readable, link to relevant pages, internal linking is very important. Don’t use “Click Here”.
He’s using an example (conferencecall.com) they use web conference in internal links and they rank at 25 now.
watch URL structure, and canonical domain, pick www or non-www and stick to it.
Use URL rewriting.
Use 301s.
Hypens or underscores? very unimportant, use either.
Unique title, description, and keywords, put the most important words first.
don’t stuff keywords tag, but use a call to action in your description tag.
now he’s using an example palmharbor.com they put modular homes/manufactured homes/mobile home in the title tag and they rank well now. Wikipedia beats them in some places (size matters).
use good IA, use keywords, keep URLs static and use 301s.

Brett Tabke
He’s talking about webmasterworld. They have about 2.5 million pages. The content can be rolled a lot of different ways. They also have a mobile version of the entire site, and a printer friendly version. something like 20 million pages the SEs could possibly see. About two years ago the rankings totally dissapeared because a bot got into the printer friendly version and caused duplicate content problems.
They just reworked the entire site’s URL structure. The most challenging programming he’s ever done. they didn’t want to redirect the old URL, they just used the new URL for new content. The new keyword urls worked better for SEs, they’re turning up for more keywords. User’s bookmarks were a big cause of confusion. They’re using a dozen different variations. The SE’s haven’t hassled them, they’re indexed better than ever before.
The site is all custom software brett wrote. He setup a big network of sites for john deere. they got a lot of questions about how to structure a site. try to find the sweet spot between SEs and users. The SEs are digital and the users are analog. Just before infoseek was sold they were working on a theme engine, where a site would be indexed as a whole for 20-50 keywords. He brings up the theme pyramid table, and the webmaster world table. The structure is simple and based on keywords. They callit the “longer tail”.

Questions:
someone suggests making your site so that a blind person can use it.

someone is asking about hidden divs and saying google traffic has tailed off
vanessa says if the javascript is turned off, the divs should be viewable, and hide the divs via javascript

if you use a google sitemap, the sitemap only augments the natural links. In addition to the “free crawl” says yahoo.
google seems to be case sensative, vanessa suggests redirect to the chosen uppercase or lowercase situation, yahoo agrees.

Someone is asking about page load time.
Yahoo says page load time and page size don’t matter for the crawler, but its not a good idea for a user. Yahoo doesn’t penalize pages that take a long time to load or that are large. Google agrees, there is a timeout, and they’ll only use up a certrain amount of bandwidth on the site. Google won’t penalize for long times or large size, matt did a post about. Its about sales, not traffic.

Someone is asking about multiple urls pointing to the same site.
There will be a session about duplicate content after lunch.

Read More