How Search Engines Work

Search Engines store information about billions of web pages, allowing users to carry out a search and find the most relevant websites to their query as opposed to remembering every website themselves. Search Engines also allows users to find websites they would not normally come across. A Search Engine will typically operate in the following way;

1. Web Crawling – Also known as spiders, these robots access websites, review them and then send the information back to the search engine. Spiders will review a website based on predetermined criteria also known as an algorithm. This will indicate to the Search Engine if your website is credible/ethical.

2. Indexing – Once a website has been evaluated it will then be shown within the Search Engine Result Pages (SERPS) based around it’s credibility. Websites are generally shown in order of compliance/credibility based around the algorithm. The information will of been passed to the Search Engine by the Crawler.

3. Searching – The Search Engine will then display the websites it has crawled for users to find. Search Engines tend to rank websites in order of credibility when matched to the algorithm. The Search Engine result pages are updated regularly as new websites are born and old ones are updated. Search Engines know just as well as users that industries change and information can become outdated rather quickly.

Webmasters have very limited communication between the search engines and their crawlers however there are several files you can upload to your website in order to manipulate a crawler or search engine ethically. Robots.txt will indiciate to the crawler were on the website it can look for new pages. Some of the pages a webmaster may want to restrict is password pages, payment pages and sections of the website were a typical user will not find valuable. There are several other communication methods such as a sitemap. A sitemap is a file uploaded onto the web which will tell a crawler all of the web pages on your website which will allow the crawler to browse your website more efficiently. There has been more recent updates between communication methods, search engines such as Google use features such as “nofollow” (this will tell a crawler not to follow this link) Rel Canonical (This will tell a search engine that this is the best URL to access this particular information and Meta data (the information displayed in the SERPS and for use of language, no odp and noindex)

Some search engines such as Google will store all or part of a websites source page (referred to as a cache) as well as information about the web pages however Search Engines such as AltaVista store every word of every page they find. The cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. There are further benefits of website caching in the form of old web pages may contain data that may no longer be available elsewhere.

When a user visits a search engine they usually input a keyword or key phrase into the query box, a search engine will then browse through it’s index of websites and return the best matched websites in order of relevance. A SERP will show usually show a websites meta page title, meta description and url. Search Engines such as Google allow users to remove certain urls from the SERP in order of preference. There are still some search engines which offer an advanced feature called proximity search which allows users to define the distance between keywords. There is also concept-based searching where the research involves using statistical analysis on pages contaiing the words of phrases you search for. Another search factor is using natural language search where the user will input a question as if it was a human, a good example of this is ask.com.

The core foundation of a search engine is the quality of the results it returns back to the user, there are other features such as loading speed and usability however result relevance is currently one of the main factors.