Pawel Szulencki Search Engine Optimization/Marketing blog.
Welcome back! Thanks for sticking by.
There are different types of search engines that use different methods of searching the Internet resources.
Robots read only links located on already found websites and based on that they create tree-like link hierarchy. Spiders read the whole content of a website, the title, links, the document text and information inside meta tags. There are alsoengines like Metacrawlers, which include Meta Search and Smarter Meta Search engines. The first one sends the search term entered by the user to indexes of different search engines simultaneously and returns requested information’s based on found search results. So they don’t have their own index servers and they do not search the internet in an active way. They use already existing indexes of other search engines. Smarter Meta Search engines use the same method with a difference. They use linguistic and collective analysis to determine even more accurate search results than engines.
Search engines do not index everything that is available on the internet though. They do not index:
All of the websites that exist but are excluded from search engine indexes for different reasons create so called “invisible network” or “invisible internet”. This invisible internet is three times bigger than all sites creating the visible internet. It is like that because companies or government institutions in an obvious way do not want to share with their data. That’s why they exclude their content from indexes or hide it inside intranet.
Google calls its spider “Googlebot”. It divides into Freshbot and Deepbot. The task of Freshbotis to find new, fresh content on a website and that’s why it visits the same sites even couple times a day. Deepbot on the other hand is responsible for deep website crawling. Its main purpose is to create a full view, a picture of your website content, navigation system and all the inbound and outbound links. If you see a search engine results change it means the Deepbot was active. In Google the search process is divided into three parts.
First the Googlebot crawls the internet in search for changes to existing websites and for new websites. It works like standard web browser. It sends request to a server for specific website and then saves it at sends it to index servers. It can request thousands of different websites simultaneously, but it deliberately makes it slower to avoid crashing web servers or overcrowding the real human requests to the same server.
Googlebot finds sites in two ways: by adding a website through an add URL form (www.google.com/addurl.html) or by finding links by crawling the internet.
When Google fetches a website it collects all the links on this site and adds it to “visit soon” URL list. That way in short time it can visit wide area of internet and make the search process faster. But it also causes problems. Google have to examine the “visit soon” URL list to check if there are duplicates of URL addresses and if so to delete the duplicate to prevent from visiting the same site too often.
To keep the indexes up to date Google re-crawls sites on regular basis. For a newspaper site or a highly visited portal it can be daily, for a stock quotes much more frequently and for other pages once a month or several times a month.
Googlebot sends then the whole text it finds on sites to Google indexer, an indexing database servers. They store the text sorted alphabetically by term - a keyword or phrase. To each term a list of documents in which it appears and where on site is attached so it’s easy to find the location of correct document for certain user query. To eliminate unimportant words, to improve the search process and to make it faster Google indexer doesn’t take into consideration most frequent words in each language. These words, called “stop words” (such as is, on, or, the, at, in, how, why) don’t make the search process any more precise, so they can be ignored.
When a user enter a search term to Google, it sends the request to indexing servers to find out if the term exists in the database. At this point it is important to realize that because of the amount of data Google holds it would be to difficult to store all information’s in one indexing server. That’s why Google uses many separate servers where each holds some part of all data. The query is send therefore to different servers simultaneously and if the term exist in the database Google generates 1000 most relevant results based on more than 100 factors (such as PageRank™, metatags, age of a website, traffic on website and many more).
At the same time to each document a special number is attached. The Document-ID is then send to file servers where a title and description of a website is added based on its metatags. If there are no metatags the title and description are generated automatically based on sites content. Also in this case there are many file servers working simultaneously.
The last stage is adding advertisements to the search results taken from the ad-servers adequate to the search term. The ad-servers keep information’s about advertisers, campaigns and they determine which advert should be published on search results page. Those ad-servers are the main income of Google company. They bring 98% of overall revenue of this search engine.
All the information’s are put together and displayed in a user web browser as dynamically generated website. And all of that is done is seconds.
For more information’s about how Google works take a look on these websites:
http://www.googleguide.com/google_works.html
http://www.portfolio.com/interactive-features/2007/08/google
http://www.portfolio.com/culture-lifestyle/goods/gadgets/2007/08/13/Ho w-Google-Works
http://en.wikipedia.org/wiki/Index_%28search_engine%29
And for last, but not least: What Google thinks about SEO and what they suggest?http://www.google.com/support/webmasters/bin/answer.py?answer= 35291
Sphere: Related ContentPawel Szulencki is a SEO (Search Engine Optimization) and Marketing certified specialist who is interested in organic SEO, paid campaigns (PPC) and Social Media Marketing channels. (Read more)
Make Money Online (1 comments.)
May 6th, 2009 at 4:07 am
Googlebots love blogs! Blogs are always refreshed with new content so they tend to crawl through blogs at a very frequent rate.
Make Money Onlines last blog post..Make Money Blog Indexed Fast
Dessy (1 comments.)
May 6th, 2009 at 10:47 pm
Thanks for the valuable info
Dessys last blog post..Best Forex platform tips
Holiday (1 comments.)
May 7th, 2009 at 7:05 pm
So you mean if compared to the bots of other search engines - GoogleBot is the most advanced one?
scarface (1 comments.)
May 7th, 2009 at 9:46 pm
For me I always give link to new site than have to add url to http://www.google.com/addurl.html , but nice post mate
scarfaces last blog post..Review,Sell link the Easiest Program to Make some $$
Promotional items (1 comments.)
May 8th, 2009 at 7:34 am
Its really a informative post. Thanks to share it.
Sanjeev (1 comments.)
May 12th, 2009 at 12:49 pm
wow,
very nice information.
Seo company (1 comments.)
May 13th, 2009 at 12:44 pm
Very Nice article Pawel ,
Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.
Wikipedia brief explanation very usefull all of us.
Regards,
Pradeep
Eric-Directory Submission Service (1 comments.)
May 14th, 2009 at 5:25 pm
Very Nice Explanation. I will bookmark your site, to read more about google. Keep Posting more.
Pawel Szulencki (171 comments.)
May 16th, 2009 at 4:29 pm
@Seo company: Google follows the links on each website finding new pages that way. It also requests all pages on servers to find additional pages.
Artvisualizer (1 comments.)
June 2nd, 2009 at 2:39 pm
wow.. thanks for the info
Artvisualizers last blog post..Review: Live Interiors 3D Pro
Scentsy(new comment)
June 4th, 2009 at 10:50 pm
Thank you for clearing that up. I’ve always wondered the difference between spiders and robots are
ip locator(new comment)
June 10th, 2009 at 10:52 am
Google love fresh content, if you’ve blog there are more way out to produce fresh content. You can see major sites moving towards fresh content and syndication.
thisispopup.com(new comment)
June 13th, 2009 at 8:48 pm
very useful information. I always wonder how google got index of everybody website.
thisispopup.com´s last blog ..GridView Plugin For JQuery
Stu | Lasik Surgeon(new comment)
June 16th, 2009 at 7:15 am
A good, concise explanation of how search engine spiders work.
Required reading for anyone interested in this kind of stuff, thanks!
Aneuk Nanggroe(new comment)
July 20th, 2009 at 12:06 pm
Thanks for the information..! this is what Ihave looking for
Aneuk Nanggroe´s last blog ..Bom Bunuh Diri, Jihad atau Sesat.??
Hand Dryers(new comment)
July 25th, 2009 at 1:37 pm
Nice article. Thanks for sharing it..:)
Hand Dryers´s last blog ..Product Information
Aron H.(new comment)
August 3rd, 2009 at 9:43 am
Really you info on Google web crawler is so nice and i really astonished from this write more about latest this and new Bing crawler.
thanks
Macky99(new comment)
August 4th, 2009 at 8:46 pm
Thanks for sharing this useful information.
Now its better to know how you are crawled.
Macky99´s last blog ..iLap Laptop Cooling Stand for Better Working with Laptop
gurgle parenting and pregnancy(new comment)
August 5th, 2009 at 6:05 pm
Top article, I’ve heard abut pagerank sculpting to divert a bot around the right areas of your site - does it work?
gurgle parenting and pregnancy´s last blog ..Summer reads
Preston web design(new comment)
August 10th, 2009 at 5:14 pm
Another comment praising you on your lovely post
I think the key is to have regularly updated fresh content. If the content is good, the rankings do come in time…we must be patient
Uzay(new comment)
August 27th, 2009 at 2:26 am
Nice article and info..than you for sharing..
Uzay´s last blog ..History Channel - The Universe
Silver Platform Shoes(new comment)
September 21st, 2009 at 6:08 am
Interesting–pretty awesome what goes on for a search. How fast it comes back to you. Always wondered how it worked. You did a great job in taking it down to layman’s terms so anyone can understand it. Thank you.
Silver Platform Shoes´s last blog ..Kirstin Dunst as Marie Antoinette in Pink Satin Shoes, Yum…
investing(new comment)
September 24th, 2009 at 9:14 pm
very helpful tutorial for beginners like me.
Thx I will learn more about the systematic google index and crawler
Making Money for Mommies(new comment)
September 28th, 2009 at 5:03 am
This is such a great article! It really explains a lot about how Google crawls the web and find web sites. I appreciate your writing this post. I will definitely be bookmarking your blog!
Making Money for Mommies´s last blog ..Freelance with ODesk!
Home Lighting(new comment)
October 3rd, 2009 at 12:43 pm
It really surprised to see this type of information at free of cost.
Email Marketing Solutions(new comment)
October 3rd, 2009 at 1:37 pm
Fantastic article, well written and extremely useful, Thanks for sharing. Keep posting!
Suneedh(new comment)
October 6th, 2009 at 11:33 pm
Great post . I hope there is some way to actually ask google crawl my site http://carpendium.com more often than now . Guess there is nothing like that yet . More I post more the site is crawled . so , CONTENT IS THE KING
Suneedh´s last blog ..Author Maya Angelou Taken to Hospital ! 
Blastoff Network(new comment)
October 14th, 2009 at 1:39 am
Wow very informative. Learned a lot about how the bots work. Thanks for the in depth info on it. I guess now I just need to start writing more.
info about settlements(new comment)
October 18th, 2009 at 5:32 pm
@ blastoff ntwork…ha tell me about it ! this post is sooo much better than other sites as they dont describe evrything, thanku for sharing it with us
Spinal Stenosis(new comment)
November 1st, 2009 at 12:57 am
If you add your website manually, it will take more time to index your pages. But if the googlebot finds your website through a link from other website, this will dramatially increase the time of the index.
Cottage Rental(new comment)
December 25th, 2009 at 10:42 am
Thanks for the valuable information.,..very useful for newbies like me.
Florida SEO(new comment)
January 4th, 2010 at 8:18 am
Blogs were the best addition to the social media world ever! Google indexes blogs constantly because of unique content. Keep blogging!
Florida SEO´s last blog ..Optimize ur Twitter Profile
Chad Timothy Nelson(new comment)
January 20th, 2010 at 4:34 am
This was a great introduction to the search engines as far as the googlebot goes. I had no idea there were so many rules about the crawler itself.
shiroi neko(new comment)
January 20th, 2010 at 10:30 am
yes google bot is really the best crawler. Thanks for sharing this information it will help a lot.