There are millions of pages for each topic on the World Wide Web. Getting the most relevant information is the biggest challenge that Search Engine's face. But still modern search engines have managed to do this. So, how a search engine works?

Websites contain unstructured information and it is the
task of search engines to locate the best document for a query. Our search
engines do this task almost perfectly. For this they use many different techniques
and algorithms. Each search engine uses a different set of algorithm and techniques
but there are some general rules that each SE follows.
- Web Crawling
- Index Building
- Searching
Web Crawling
Search engine spiders or bots crawl each and every pages and websites
available on the World Wide Web to get the most relevant information’s of the content
of the websites. (Spiders/bots are the software robots). Crawling starts from a
list of most popular websites and they follow each hyperlinks that these
websites link to and crawlers do this process recursively for each website. In
this way spiders can crawl at a really fast speed. Bots while crawling makes
the list of keywords used in title, heading, META TAGS, etc and this information
help the SE’s to get the most relevant information about a topic.
Meta tags are the special HTML tags that search engines use
to get the exact information about the web page. Meta tags can be used to
improve the page rank of a website by the website developers.
The site owners can use robots.txt file to exclude there
website/pages from crawling.
Indexing
The search engine stores and use the information about the
website sent by the spiders to make an index of the website. The information
that bots fetch from the Meta tags, title, heading, text and pictures are then
collected, parsed and stored in the database. This indexed information makes
the search engines result faster and reliable. If the search engines don’t index
the info about a page then for each query, from each user they need to scan whole
World Wide Web for the relevant result, this can take hours! In this way each
document on the internet has keywords associated with them.
Google store info sent by the crawlers as well as a cache
copy of the web page. It uses this cache copy to display the data that bots has
sent may no longer exists or removed. This increases the reliability of the
Google.
Searching
Search engines use different algorithms and techniques very
efficiently to give the most relevant result for a user's query. Search engines
find millions of results associated with a keyword. Here I’m explaining some general
rules that each search engine uses.
Term Frequency approach:
Ranking of a page/document on the
basis of no of times a keyword occurs and the length of the document. This is
how the earlier search engines give work.
Term
Frequency, TF = log (1+b/a)
b: no of times a keyword occurs
a: no of terms in document
The above result can even be refined by using measures like if
a keyword occurs in the title and at what point it occurred.
Term Frequency-Inverse Document Frequency:
To give more accurate information for multiple keyword query
search engine stared using TF-IDF (term frequency- inverse document frequency)
approach. A simple approach is to sum up TF of each keyword in the query but all
the keyword may not be equal. Some keywords are common other may not.
For example: The query - Programming techcodify
In this query Programming term is very common than
techcodify. So the Search Engines will give more preference to the web pages
that contains techcodify more frequently than programming. For this inverse
document frequency is calculated as
IDF=1/ c
c: no of document that contains the term in their keyword
So, relevance ranking of a
website is calculated as
TF-IDF=sum of each
keyword’s (TF*IDF) in the query
Another factor that is considered is proximity i.e. how
close keywords of the query occur on the web page. The page that contains the
keywords closer is ranked higher. Keyword like it, then, this,or etc are given no weightage as their IDF is very-very low. Since, all documents contain these keywords.
Later, Many websites exploited the above approach by using a
keyword again and again and got a higher ranking even after not having relevant
information.
Earlier Search engines only used -IDF factors but the modern
search engines like Google, Yahoo, Bing uses many other factors.
Popularity Ranking or relevance using Hyperlinks:
A website's ranking or relevance is decided on the basis of
its popularity i.e. how many websites link to a particular website and what is
their ranking. This approach may not sound very accurate but search engines can’t
trust a newer or less popular websites. So, it is reasonable to use this approach. Nowadays this is the most important factor that determines the ranking of the website.
This is how a search engine works.So, next time when you search something on the search engines just keep this in mind that a lot is happening behind the scene.
Do you like this tutorial? Do Comment and let us know.
Thank You
ReplyDeleteEXCELLENT
ReplyDeleteThank you shiva.
Delete