How a Search Engine Works |Amit Tripathi's Blog

There are millions of pages for each topic on the World Wide Web. Getting the most relevant information is the biggest challenge that Search Engine's face. But still modern search engines have managed to do this. So, how a search engine works?

Websites contain unstructured information and it is the task of search engines to locate the best document for a query. Our search engines do this task almost perfectly. For this they use many different techniques and algorithms. Each search engine uses a different set of algorithm and techniques but there are some general rules that each SE follows.

Web Crawling
Index Building
Searching

Web Crawling

Search engine spiders or bots crawl each and every pages and websites available on the World Wide Web to get the most relevant information’s of the content of the websites. (Spiders/bots are the software robots). Crawling starts from a list of most popular websites and they follow each hyperlinks that these websites link to and crawlers do this process recursively for each website. In this way spiders can crawl at a really fast speed. Bots while crawling makes the list of keywords used in title, heading, META TAGS, etc and this information help the SE’s to get the most relevant information about a topic.

Meta tags are the special HTML tags that search engines use to get the exact information about the web page. Meta tags can be used to improve the page rank of a website by the website developers.

The site owners can use robots.txt file to exclude there website/pages from crawling.

Indexing

The search engine stores and use the information about the website sent by the spiders to make an index of the website. The information that bots fetch from the Meta tags, title, heading, text and pictures are then collected, parsed and stored in the database. This indexed information makes the search engines result faster and reliable. If the search engines don’t index the info about a page then for each query, from each user they need to scan whole World Wide Web for the relevant result, this can take hours! In this way each document on the internet has keywords associated with them.

Google store info sent by the crawlers as well as a cache copy of the web page. It uses this cache copy to display the data that bots has sent may no longer exists or removed. This increases the reliability of the Google.

Searching

Search engines use different algorithms and techniques very efficiently to give the most relevant result for a user's query. Search engines find millions of results associated with a keyword. Here I’m explaining some general rules that each search engine uses.

Term Frequency approach:

Ranking of a page/document on the basis of no of times a keyword occurs and the length of the document. This is how the earlier search engines give work.

Term Frequency, TF = log (1+b/a)

b: no of times a keyword occurs

a: no of terms in document

The above result can even be refined by using measures like if a keyword occurs in the title and at what point it occurred.

Term Frequency-Inverse Document Frequency:

To give more accurate information for multiple keyword query search engine stared using TF-IDF (term frequency- inverse document frequency) approach. A simple approach is to sum up TF of each keyword in the query but all the keyword may not be equal. Some keywords are common other may not.

For example: The query - Programming techcodify

In this query Programming term is very common than techcodify. So the Search Engines will give more preference to the web pages that contains techcodify more frequently than programming. For this inverse document frequency is calculated as

IDF=1/ c

c: no of document that contains the term in their keyword

So, relevance ranking of a website is calculated as

TF-IDF=sum of each keyword’s (TF*IDF) in the query

Another factor that is considered is proximity i.e. how close keywords of the query occur on the web page. The page that contains the keywords closer is ranked higher. Keyword like it, then, this,or etc are given no weightage as their IDF is very-very low. Since, all documents contain these keywords.

Later, Many websites exploited the above approach by using a keyword again and again and got a higher ranking even after not having relevant information.

Earlier Search engines only used -IDF factors but the modern search engines like Google, Yahoo, Bing uses many other factors.

Popularity Ranking or relevance using Hyperlinks:

A website's ranking or relevance is decided on the basis of its popularity i.e. how many websites link to a particular website and what is their ranking. This approach may not sound very accurate but search engines can’t trust a newer or less popular websites. So, it is reasonable to use this approach. Nowadays this is the most important factor that determines the ranking of the website.

This is how a search engine works.So, next time when you search something on the search engines just keep this in mind that a lot is happening behind the scene.

Do you like this tutorial? Do Comment and let us know.

Amit Tripathi's Blog

Pages

January 22, 2014