Google runs on a distributed network of thousands of low-cost
computers and can therefore carry out fast parallel processing. This parallel
processing has a method of computation in which many calculations can be
performed simultaneously, significantly speeding up data processing. Pero the
important knowledge we should know about how the google search engine works is
yung algorithm na they used for its system which is called “Pagerank
Algorithm”, this is the software behind the google search technology that
conducts a series of simultaneous calculations na nag re-require lang ng halos
a fraction of a second for the process nito. The traditional search engines is
nag re-rely lamang on how often a word appears on a web page. Trough the use of
the Pagerank algorithm, it examine the entire link structure of the web and
determines which pages are most important. It then conducts hypertext-matching
analysis to determine which pages are relevant to the specific search being
conducted. By combining overall importance and query-specific relevance, it
ables to put the most relevant and reliable results first(The Google search Engine and Google Graphic).
The factual information about Pagerank algorithm is that it
helps rank web pages that match a given search string. The PageRank algorithm
instead analyzes human-generated links assuming that web pages linked from many important pages are
themselves likely to be important. The algorithm computes a recursive score for pages, based on the weighted sum of the PageRanks of
the pages linking to them. PageRank is thought to correlate well with human concepts of importance. A Web page's PageRank depends
on a few factors; one is the frequency
and location of keywords within the Web page which the keyword only
appears once within the body of a page, it will receive a low score for that
keyword. The second is how long the Web page has existed, that determines people
create new Web pages every day, and not all of them stick around for long.
Google places more value on pages with an established history. The last would
be the number of other Web pages that
link to the page in question which google looks at how many Web
pages link to a particular site to determine its relevance(CyberAge Books 2001).
The process of google search engine has three distinct parts,
namely the googlebot it is a web crawler that finds and fetches web pages, the
second is the indexer that sorts every word on every page and stores the
resulting index of words in a huge database. The last is the query processor,
which compares your search query to the index and recommends the documents that
it considers most relevant(CyberAge Books 2001).
Googlebot,
Google’s Web Crawler
Googlebot
is Google’s web crawling robot, which finds and retrieves pages on the web and
hands them off to the Google indexer. It’s easy to imagine Googlebot as a
little spider scurrying across the strands of cyberspace, but in reality
Googlebot doesn’t traverse the web at all. It functions much like your web
browser, by sending a request to a web server for a web page, downloading the
entire page, then handing it off to Google’s indexer. Googlebot consists of
many computers requesting and fetching pages much more quickly than you can
with your web browser. In fact, Googlebot can request thousands of different
pages simultaneously. To avoid overwhelming web servers, or crowding out
requests from human users, Googlebot deliberately makes requests of each
individual web server more slowly than it’s capable of doing. Googlebot finds
pages in two ways: through an add URL form and through finding links by
crawling the web(Copyright © 2003 Google Inc.).
When
Googlebot fetches a page and yung lahat ng links appearing on it and adds them
to a queue for subsequent crawling. The googlebot tends to encounter little
spam because most web authors link only to what they believe are high-quality
pages. By harvesting links from every page it encounters, Googlebot can quickly
build a list of links that can cover broad reaches of the web. This technique,
known as deep crawling, also allows Googlebot to probe deep within individual
sites. Because of their massive scale, deep crawls can reach almost every page
in the web. Because the web is vast, this can take some time, so some pages may
be crawled only once a month. Although its function is simple, Googlebot must
be programmed to handle several challenges. First, since Googlebot sends out
simultaneous requests for thousands of pages, the queue of “visit
soon” URLs must be constantly examined and compared with URLs already in
Google’s index. Duplicates in the queue must be eliminated to prevent Googlebot
from fetching the same page again. Googlebot must determine how often to
revisit a page. On the one hand, it’s a waste of resources to re-index an
unchanged page. On the other hand, Google wants to re-index changed pages to
deliver up-to-date results (Copyright © 2003 Google Inc.).
Google’s Indexer
The
Googlebot gives the indexer the full text of the pages it finds. These
pages are stored in Google’s index database. This index is sorted
alphabetically by search term, with each index entry storing a list of
documents in which the term appears and the location within the text where it
occurs. This data structure allows rapid access to documents that contain user
query terms. To improve search performance, Google ignores common words called stop words such as the, is, on,
or, of, how, why,
as well as certain single digits and single letters. Stop words are so common
that they do little to narrow a search, and therefore they can safely be
discarded. The indexer also ignores some punctuation and multiple spaces, as
well as converting all letters to lowercase, to improve Google’s performance (Copyright © 2003 Google Inc.).
Google’s Query
Processor
The query processor
has several parts, including the user interface which is the search box, then
the engine that evaluates queries and matches them to
relevant documents, at yung results formatter. PageRank is Google’s system for ranking web pages. A page with a
higher PageRank is search first and ito yung more important and is more likely
to be listed above a page with a lower PageRank. Google considers over a
hundred factors in computing a PageRank and determining kung anong documents
are most relevant to a query, including the popularity of the page, the
position and size of the search terms within the page, and the proximity of the
search terms to one another on the page. Google also applies machine-learning
techniques to improve its performance automatically by learning relationships
and associations within the stored data. For example, the spelling-correcting system uses such techniques to figure
out likely alternative spellings. Google closely guards the formulas it uses to
calculate relevance; they’re tweaked to improve quality and performance, and to
outwit the latest devious techniques used by spammers. Indexing the full text
of the web allows Google to go beyond simply matching single search terms.
Google gives more priority to pages that have search terms near each other and
in the same order as the query. Google can also match multi-word phrases and
sentences. Since Google indexes HTML code in addition to the text on the page,
users can restrict searches on the basis of where query words appear, in the
title, in the URL, in the body, and in links to the page, options offered by Google’s Advanced Search Form and Using Search Operators “Advanced
Operators” (Copyright © 2003 Google Inc.).
And
lahat ng ito is done in less than a second, 300 million times a day generating
over $20 billion a year for google so they earn many money talaga.
Eric Jeffrey Arriola
200911820
200911820
0 comments:
Post a Comment