What Search Engine Actually Do? [Crawling, Indexing]-Explained

Imagine, If webpage was a Human, Then the web would be a World, Made up of 60 Trillion Human. And think, Everyone is creative here. Someone is writing, Someone is making videos, Someone is taking pictures. Everyone has something.
Crawler And Bot


Now the task is you want some Burger video. How can you find a burger video from 60 trillion humans?That the search engine does.

Search engine collects the all information of 60 trillion humans, They analyze, And Show you the best possible video of a burger.

The First task is Crawling. You may want to know now, what is Crawling? I'm Gonna Tell you:


Crawling & Bot:

In simple meaning, Crawling means follow something. Every search engine has Bot/Robot. This bot is just a program with some algorithm.

This bot is travelling between from page to page and collect information of the Webpage by following link. When the link is encoded with some keyword, The bot thinks that it has something related to keyword.

This bot is just traveller, and it travelled from webpage to webpage and collect information about webpage, and send the data to the search engine. And further that,  the search engine has separate algorithm to analyze this information, And Find the best possible result for the visitors.
The analyzing and Algorithmic calculation is called Indexing.

Web manager/ developer can give request to the bot to visit the web page by robots.txt on root directory.

Example: https://www.geek.com.bd/robots.txt
To know more about robots.txt, visit : ##How robots.txt works? how to optimize it?##

If developer do not include robots.txt file. Bot still can travel the webpage to collect info.

Making your site easy to crawl is good for SEO,

Indexing : 

Layman said, Indexing is the process of adding webpages into Google search.
Indexing depends on which meta tag you use. There are two types of meta tags are available, one is index and the other is NO-index.

Simply, Index for request to index and NO-index to Not to index request. For search engines, its depends which meta tag have you used which meta tag you used (index or NO-index). Then search engine decides to index or not.

As default, Every WordPress and blogger sites are requested to index.If you want to better result in SEO, please make sure that your website has an index meta tag.
Do not indexes unnecessary archives like tags, categories, and all other useless pages.

Factors That May Affect Crawling

Positive Factors:

  • High-Quality Content (What is high quality, what is low-quality content?)
  • Lower HTML, Higher Text
  • Fast Web page
  • Avoiding Unnecessary linking
  • Good domain name
  • Backlink from Good site
  • Optimized Internal Linking
  • A Good Optimized Sitemap
  • Pinging

Things to avoid for better crawling: 

  • Slow website 
  • 5XX Errors
  • Duplicate content within your site
  • Low-Quality pages
  • Spammy pages (High backlink)
  • Long redirect chains
  • Long page-load times that may timeout
  • Nonstrategic use of noindex and nofollow tags
  • Pages served up through AJAX without links in the page source
  • Blocking bots from crawling JavaScript and CSS files
  • “Dirt” in your sitemap



Some Famous Crawler/Bot/Robot: 
  • Applebot
  • Baiduspider(For Baidu search engine)
  • Bingbot
  • GoogleBot
  • ia_archiver
  • MSNbot (For MSN Search Engine)
  • Neverbot
  • Seznambot
  • Slurp
  • teoma
  • Twitterbot
  • Yandex (for Russian Yandex search engine)
  • Yeti

Limitations of Search Engine Technology: 

The major search engine has powerful bot/crawler, they use artificial intelligence, Big query, Big data and may be far better than this type technology. But there is still some limitations has.

1. In site duplicate content: Crawler is always looking for something new content. If they found something duplicate, they graded it negatively. But the modern website has more complex feature. And in some time, it's not always possible to provide Unique content for the same type of thing.

2. Online forms:  If bot found some online form such as login form, sometimes they can't able to crawl inside things behind the form.

3. So much sensitive to robots.txt and sitemap: Crawler is so much sensitive with robots.txt. If some misconfiguration happens in robots.txt. It may affect the whole google presence of the website.Ans sometime misconfiguration of sitemap may also degrade google presence or access of crawler.

4. Problems of matching queries to content: The difference between 'color' and 'colour' is nothing. One is in British and other is American. But google treat those keywords in different want. And that's the beginning of the problem. When you search 'color'. Google will calculate only with the articles that have 'color' keyword. But actually, we are looking for the best of content related to 'color' and 'colour' both.   

Share this:

1 comment :


  1. This is an awesome post. Really very informative and creative contents. This concept is a good way to enhance the knowledge. I like it and help me to development very well. Thank you for this brief explanation and very nice information. Well, got a good knowledge.
    SEO Course in Chennai

    ReplyDelete

 
Copyright © Geek Bangladesh. Designed by OddThemes