Overwhelming
number of search-engines in the WWW like Google,
AltaVista, Lycos, InfoSeek etc. are spider-based.
An understanding of how they work can greatly
help you make the best out of them.
Though the term "search engine" is
often used to describe all kinds of retrieval
tools, spider-based search engines differ considerably
from human-powered directories. We discussed human-powered
directories in last issue, this week we take a
close look at spider-based search engines.
Unlike directory-type search engines,
spider-based search engines (also called crawlers,
robots, worms) seek out webpages by 'crawling'
through the WWW and automatically index sites
using its own indexing rules or algorithm.
By simply telling the search engine
what your URL is, its software robot will go there
automatically and index everything they need.
How much it will index and to what degree depends
upon its algorithm - a closely guarded secret
in many cases.
Parts
of Spider-Based Search Engine
Spider-based search engines have three
major elements:
-
Spider
-
Index
-
Search
The spider or crawler, as its name
implies, crawls through the WWW, finds web page,
reads it, and then follows links to other pages
within the site. It repeats this process at regular
intervals to check for new information/changes
in the page.
Information collected by the spider
goes into the second part of the search engine
- the index. The index is like a giant book containing
a copy of every web page that the spider finds.
If a web page changes, then this book is updated
with new information.
The above two parts work in the background,
we only get to see the third part of a search
engine - the search software. This is a computer
program that sifts through the millions of pages
recorded in the index to find matches to a search
and rank them in an order of relevance. The order
of relevance is entirely decided by its own algorithm.
Features
of Spider-based Search Engine and Implication
in Search Result
The ability of a spider to crawl through
millions of web-pages and creating index without
human intervention makes it very powerful search
tool with extremely broad coverage. The second
ability of checking for changes/new information
in indexed pages by re-visiting them at regular
intervals and keeping the index up-to-date, again
without human intervention - is really awesome.
However, the greatest strength of
spider-based search engine is also its greatest
weakness. Great coverage and absence of human
editing ensures significant amount of junk or
useless information in search result. This is
particularly so when search query is loosely worded.
The key to get the best out of a spider-based
search engine is to understand some basics of
searching. We shall discuss a few tips that can
get you significantly better search result in
next issues.
|