Search This Blog

Saturday 30 May 2015

Chapter 1. Searching Google

1.1 Google's front page is deceptively simple: a search form and a couple of buttons. Yet that basic interface—so alluring in its simplicity—belies the power of the Google engine underneath and the wealth of information at its disposal. And if you use Google's search syntax to its fullest, the Web is your research oyster.

But first you need to understand what the Google index isn't. 

1.2 What Google Isn't

The Internet is not a library. The library metaphor presupposes so many things—a central source for resource information, a paid staff dutifully indexing new material as it comes in, a wellunderstood and rigorously adhered-to ontology—that trying to think of the Internet as a library can be misleading.

Let's take a moment to dispel some of these myths right up front.

  • Google's index is a snapshot of all that there is online. No search engine—not even Google—knows everything. There's simply too much and its all flowing too fast to keep up. Then there's the content Google notices but chooses not to index at all: movies, audio, Flash animations, and innumerable specialty data formats.
  • Everything on the Web is credible. It's not. There are things on the Internet that are biased, distorted, or just plain wrong—whether intentional or not. Visit the Urban Legends Reference Pages (http://www.snopes.com/) for a taste of the kinds of urban legends and other misinformation making the rounds of the Internet.
  • Content filtering will protect you from offensive material. While Google's optional content filtering is good, it's certainly not perfect. You may well come across an offending item among your search results.
  • Google's index is a static snapshot of the Web. It simply cannot be so. The index, as with the Web, is always in flux. A perpetual stream of spiders deliver new-found pages, note changes, and inform of pages now gone. And the Google methodology itself changes as its designers and maintainers learn. Don't get into a rut of searching a particular way; to do so is to deprive yourself of the benefit of Google's evolution. 

1.3 What Google Is

The way most people use an Internet search engine is to drop in a couple of keywords and see what turns up. While in certain domains that can yield some decent results, it's becoming less and less effective as the Internet gets larger and larger.

Google provides some special syntaxes to help guide its engine in understanding what you're looking for. This section of the book takes a detailed look at Google's syntax and how best to use it. Briefly: 

Within the page
Google supports syntaxes that allow you to restrict your search to certain components of a page, such as the title or the URL.
Kinds of pages
Google allows you to restrict your search to certain kinds of pages, such as sites from the educational (EDU) domain or pages that were indexed within a particular period of time.
Kinds of content
With Google, you can find a variety of file types; for example, Microsoft Word documents, Excel spreadsheets, and PDF files. You can even find specialty web pages the likes of XML, SHTML, or RSS.
Special collections
Google has several different search properties, but some of them aren't as removed from the web index as you might think. You may be aware of Google's index of news stories and images, but did you know about Google's university searches? Or how about the special searches that allow you to restrict your searches by topic, to BSD, Linux, Apple, Microsoft, or the U.S. government? 

These special syntaxes are not mutually exclusive. On the contrary, it's in the combination that the true magic of Google lies. Search for certain kinds of pages in special collections or different page elements on different types of pages.

If you get one thing out of this book, get this: the possibilities are (almost) endless. This book can teach you techniques, but if you just learn them by rote and then never apply them, they won't do you any good. Experiment. Play. Keep your search requirements in mind and try to bend the resources provided in this book to your needs—build a toolbox of search techniques that works specifically for you. 

1.4 Google Basics

Generally speaking, there are two types of search engines on the Internet. The first is called the searchable subject index. This kind of search engine searches only the titles and descriptions of sites, and doesn't search individual pages. Yahoo! is a searchable subject index. Then there's the full-text search engine, which uses computerized "spiders" to index millions, sometimes billions, of pages. These pages can be searched by title or content, allowing for much narrower searches than searchable subject index. Google is a full-text search engine.

Whenever you search for more than one keyword at a time, a search engine has a default method of how to handle that keyword. Will the engine search for both keywords or for either keyword? The answer is called a Boolean default; search engines can default to Boolean AND (it'll search for both keywords) or Boolean OR (it'll search for either keyword). Of course, even if a search engine defaults to searching for both keywords (AND) you can usually give it a special command to instruct it to search for either keyword (OR). But the engine has to know what to do if you don't give it instructions. 

1.4.1 Basic Boolean

Google's Boolean default is AND; that means if you enter query words without modifiers, Google will search for all of them. If you search for:
snowblower Honda "Green Bay"
Google will search for all the words. If you want to specify that either word is acceptable, you put an OR between each item:
snowblower OR snowmobile OR "Green Bay"
If you want to definitely have one term and have one of two or more other terms, you group them with parentheses, like this:
snowblower (snowmobile OR "Green Bay")
This query searches for the word "snowmobile" or phrase "Green Bay" along with the word "snowblower." A stand-in for OR borrowed from the computer programming realm is the | (pipe) character, as in:
snowblower (snowmobile | "Green Bay")
If you want to specify that a query item must not appear in your results, use a - (minus sign or dash).
snowblower snowmobile -"Green Bay"
This will search for pages that contain both the words "snowblower" and "snowmobile," but not the phrase "Green Bay." 

1.4.2 Simple Searching and Feeling Lucky

The I'm Feeling Lucky™ button is a thing of beauty. Rather than giving you a list of search results from which to choose, you're whisked away to what Google believes is the most relevant page given your search, a.k.a. the top first result in the list. Entering washington post and clicking the I'm Feeling Lucky button will take you directly to http://www.washingtonpost.com/. Trying president will land you at http://www.whitehouse.gov/. 

1.4.3 Just in Case

Some search engines are "case sensitive"; that is, they search for queries based on how the queries are capitalized. A search for "GEORGE WASHINGTON" on such a search engine would not find "George Washington," "george washington," or any other case combination. Google is not case sensitive. If you search for Three, three, or THREE, you're going to get the same results. 

1.4.4 Other Considerations

There are a couple of other considerations you need to keep in mind when using Google. First, Google does not accept more than 10 query words, special syntax included. If you try to use more than ten, they'll be summarily ignored. There are, however, workarounds.

Second, Google does not support "stemming," the ability to use an asterisk (or other wildcard) in the place of letters in a query term. For example, moon* in a search engine that supported stemming would find "moonlight," "moonshot," "moonshadow," etc. Google does, however, support an asterisk as a full word wildcard. Searching for "three * mice" in Google would find "three blind mice," "three blue mice," "three red mice," and so forth.

On the whole, basic search syntax along with forethought in keyword choice will get you pretty far. Add to that Google's rich special syntaxes, described in the next section, and you've one powerful query language at your disposal. 

1.5 The Special Syntaxes

In addition to the basic AND, OR, and quoted strings, Google offers some rather extensive special syntaxes for honing your searches.

Google being a full-text search engine, it indexes entire web pages instead of just titles and descriptions. Additional commands, called special syntaxes, let Google users search specific parts of web pages or specific types of information. This comes in handy when you're dealing with 2 billion web pages and need every opportunity to narrow your search results. Specifying that your query words must appear only in the title or URL of a returned web page is a great way to have your results get very specific without making your keywords themselves too specific.

intitle:
intitle: restricts your search to the titles of web pages. The variation, allintitle: finds pages wherein all the words specified make up the title of the web page. It's probably best to avoid the allintitle: variation, because it doesn't mix well with some of the other syntaxes.
intitle:"george bush"
allintitle:"money supply" economics
inurl:
inurl: restricts your search to the URLs of web pages. This syntax tends to work well for finding search and help pages, because they tend to be rather regular in composition. An allinurl: variation finds all the words listed in a URL but doesn't mix well with some other special syntaxes.
inurl:help
allinurl:search help
intext:
intext: searches only body text (i.e., ignores link text, URLs, and titles). There's an allintext: variation, but again, this doesn't play well with others. While its uses are limited, it's perfect for finding query words that might be too common in URLs or link titles.
intext:"yahoo.com"
intext:html
inanchor:
inanchor: searches for text in a page's link anchors. A link anchor is the descriptive text of a link. For example, the link anchor in the HTML code <a href="http://www.oreilly.com>O'Reilly and Associates</a> is "O'Reilly and Associates."
inanchor:"tom peters"
site:
site: allows you to narrow your search by either a site or a top-level domain. AltaVista, for example, has two syntaxes for this function (host: and domain:), but Google has only the one.
site:loc.gov
site:thomas.loc.gov
site:edu
site:nc.us
link:
link: returns a list of pages linking to the specified URL. Enter link:www.google.com and you'll be returned a list of pages that link to Google. Don't worry about including the http:// bit; you don't need it, and, indeed, Google appears to ignore it even if you do put it in. link: works just as well with "deep" URLs—http://www.raelity.org/apps/blosxom/ for instance—as with top-level URLs such as raelity.org.
cache:
cache: finds a copy of the page that Google indexed even if that page is no longer available at its original URL or has since changed its content completely. This is particularly useful for pages that change often.
If Google returns a result that appears to have little to do with your query, you're almost sure to find what you're looking for in the latest cached version of the page at Google.
cache:www.yahoo.com
daterange:
daterange: limits your search to a particular date or range of dates that a page was indexed. It's important to note that the search is not limited to when a page was created, but when it was indexed by Google. So a page created on February 2 and not indexed by Google until April 11 could be found with daterange: search on April 11. Remember also that Google reindexes pages. Whether the date range changes depends on whether the page content changed. For example, Google indexes a page on June 1. Google reindexes the page on August 13, but the page content hasn't changed. The date for the purpose of searching with daterange: is still June 1.
Note that daterange: works with Julian, not Gregorian dates (the calendar we use every day.) There are Gregorian/Julian converters online, but if you want to search Google without all that nonsense, use the FaganFinder Google interface (http://www.faganfinder.com/engines/google.shtml), offering daterange: searching via a Gregorian date pull-down menu. Some of the hacks deal with daterange: searching without headaches, so you'll see this popping up again and again in the book.
"George Bush" daterange:2452389-2452389
neurosurgery daterange:2452389-2452389
filetype:
filetype: searches the suffixes or filename extensions. These are usually, but not necessarily, different file types. I like to make this distinction, because searching for filetype:htm and filetype:html will give you different result counts, even though they're the same file type. You can even search for different page generators, such as ASP, PHP, CGI, and so forth—presuming the site isn't hiding them behind redirection and proxying. Google indexes several different Microsoft formats, including: PowerPoint (PPT), Excel (XLS), and Word (DOC).
homeschooling filetype:pdf
"leading economic indicators" filetype:ppt
related:
related:, as you might expect, finds pages that are related to the specified page. Not all pages are related to other pages. This is a good way to find categories of pages; a search for related:google.com would return a variety of search engines, including HotBot, Yahoo!, and Northern Light.
related:www.yahoo.com
related:www.cnn.com
info:
info: provides a page of links to more information about a specified URL. Information includes a link to the URL's cache, a list of pages that link to that URL, pages that are related to that URL, and pages that contain that URL. Note that this information is dependent on whether Google has indexed that URL or not. If Google hasn't indexed that URL, information will obviously be more limited.
info:www.oreilly.com
info:www.nytimes.com/technology
phonebook:
phonebook:, as you might expect, looks up phone numbers. 
phonebook:John Doe CA
phonebook:(510) 555-1212
As with anything else, the more you use Google's special syntaxes, the more natural they'll become to you. And Google is constantly adding more, much to the delight of regular webcombers.

If, however, you want something more structured and visual than a single query line, Google's Advanced Search should be fit the bill. 

1.6 Advanced Search

The Google Advanced Search goes well beyond the capabilities of the default simple search, providing a powerful fill-in form for date searching, filtering, and more.

Google's default simple search allows you to do quite a bit, but not all. The Google Advanced Search (http://www.google.com/advanced_search?hl=en) page provides more options such as date search and filtering, with "fill in the blank" searching options for those who don't take naturally to memorizing special syntaxes.

Most of the options presented on this page are self-explanatory, but we'll take a quick look at the kinds of searches that you really can't do with any ease using the simple search's single text-field interface. 

1.6.1 Query Word Input

Because Google uses Boolean AND by default, it's sometimes hard to logically build out the nuances of just the query you're aiming for. Using the text boxes at the top of the Advanced Search page, you can specify words that must appear, exact phrases, lists of words, at least one of which must appear, and words to be excluded.

1.6.2 Language

Using the Language pull-down menu, you can specify what language all returned pages must be in, from Arabic to Turkish.

1.6.3 Filtering

Google's Advanced Search further gives you the option to filter your results using SafeSearch. SafeSearch filters only explicit sexual content (as opposed to some filtering systems that filter pornography, hate material, gambling information, etc.). Please remember that machine filtering isn't 100% perfect.

1.6.4 File Format

The file format option lets you include or exclude several different Microsoft file formats, including Word and Excel. There are a couple of Adobe formats (most notably PDF) and Rich Text Format as options here too. This is where the Advanced Search is at its most limited; there are literally dozens of file formats that Google can search for, and this set of options represents only a small subset.

1.6.5 Date

Date allows you to specify search results updated in the last three months, six months, or year. This date search is much more limited than the daterange: syntax, which can give you results as narrow as one day, but Google stands behind the results generated using the date option on the Advanced Search, while not officially supporting the use of the daterange search.

The rest of the page provides individual search forms for other Google properties, including news search, page-specific search, and links to some of Google's topic -specific searches. The news search and other topic specific searches work independently of the main advanced search form at the top of the page. 

The advanced search page is handy when you need to use its unique features or you need some help putting a complicated query together. Its "fill in the blank" interface will come in handy for the beginning searcher or someone who wants to get an advanced search exactly right. That said, bear in mind it is limiting in other ways; it's difficult to use mixed syntaxes or build a single syntax search using OR. For example, there's no way to search for (site:edu OR site:org) using the Advanced Search.

Of course, there's another way you can alter the search results that Google gives you, and it doesn't involve the basic search input or the advanced search page. It's the preferences page. 

0 comments:

Post a Comment