Review: Implementing the Google Search Appliance in an Intranet environment

Our corporate intranet is a non-framed environment with both Lotus Domino and IIS (.Net and classic ASP) applications and content. We have between 300,000-500,000 pages of web content and documents across more than 1200 "sites" on approximately 30 unique domains. We used to have Inktomi's UltraSeek Server 3.0 as our intranet search engine which was beginning to look like its age (purchased in 1998). The Inktomi product did not handle attachments well (DOC, PPT, PDF, etc.), would not crawl our secured sites, and was no longer supported by the vendor. We did a cursory review of the search vendors and were immediately attracted to Google's 30 day trial offer for their Google Search Appliance (GSA). After signing a standard agreement, they shipped us a brand new shiny yellow unit which we could test for 30 days before returning or purchasing.

Product info

The GSA is a "black box" 1U standard rack-mountable server. By "black box" I mean, Google gives you a web interface to administer the device but do not want you to access the Operating System (a heavily Google-customized version of Linux). In fact, the license agreement stipulates that you will not tamper with the hardware or OS of the appliance in any way. The device has no need for a keyboard, mouse or video ? all you need for normal operation is a network cable and standard power input.

The GSA comes in different flavors to fit different needs varying by size of the hardware and correspondingly size of the license. (Licensing is based on the number of URLs crawled by the appliance.) There are 3 different hardware configurations; the GB-1001, GB-5005, and GB-800. These are broken down as follows;

  • GB-1001 ? 150K documents for $28K, 300K documents for $50K
  • GB-5005 ? 1.5M documents for $230K
  • GB-8008 ? 4M documents for $450K

Why Google?

As advertised, the GSA met all of our needs being able to index the large variety of filetypes we have in our environment, access secured content, having a documented API, etc. The Google brand power was another big selling factor. When we told our users that they were going to get a Google-based search engine they knew their days of troubled searching were over. Lastly, the 30-day trial run experience we had with the GSA sealed the deal. The appliance is the easiest enterprise solution I've ever had to install, configure and maintain. We were literally up and running within an hour of opening the shipping box.

Installation

The appliance has two network ports on the back panel; one for normal operation and the other used exclusively for network configuration. To configure the network settings we connected a laptop to the appliance via a special (some pin-outs are non-standard) orange Ethernet cable which is included. The installation process was about as easy as one can imagine for a "black box."

First we plugged in the normal operation network cable and then the power. The power plug on the appliance IS the power switch; plug it in to turn on and unplug it to turn it off. After plugging it in, we waited about 5 minutes for the appliance to play a tune which is the signal to continue. Next, we hooked up our laptop (already set to DHCP mode) to the appliance and powered it up. After logging in to our laptop and making sure we had the correct IP assigned by the appliance's built-in DHCP server we are ready to configure the network settings. Total elapsed time (excluding rack mounting): 10 minutes.

Configuration

Network configuration, like normal administration, is done entirely through a browser and is a simple 5 step process. The first screens ask you for basic network information; IP address, subnet mask, default gateway, and DNS. Subsequent screens collect SMTP server, "From" address for GSA notification messages, time zone, NTP (time) servers and the admin account name/password. The last step is to test a few URLs which you will be crawling to make sure you've done the setup correctly. After a final settings review screen configuration is complete and you can then unplug your laptop and get to the good part; start crawling. Total elapsed time: 10 minutes.

Crawling the site(s)

Using the URL provided, all administration of the GSA is done remotely. After logging in with the ID/password we provided in the previous step, we were presented with the Administration console. We created a new collection to hold our index, put in the "Start crawling from" URL, copied that same URL into the "Follow and Crawl only URLs with the Following patterns" box and we were done. We saved our settings and then clicked the "Start crawling" button. We then went over to the "Crawl status" screen and watched the "Crawled URLs" counter increase. Google advertises that it can crawl about 4,000 URLs in about 15 minutes or so. We found the crawl time would increase significantly if there are documents (Word, PDF, Excel, etc.) linked to from those URLS.

After the crawl is done the collection is automatically indexed and then checked against the Serving Prerequisites (any criteria you wish to use to determine whether to move an indexed collection to production) and the collection will either be moved to Production (and consequently searchable) or be moved to Staging. The Staging area lets you validate new crawls before letting users search against them.

Crawling configuration

After your first crawl you may find the need to go back and tweak the crawling parameters. Google gives you a good amount of control over how sites are crawled, the frequency, how many threads are used, etc. For sites with security, the GSA supports Basic Authentication and an additional security module is available which supports Forms Authentication. The most challenging configuration aspects for us were determining the right combination of URL patterns to exclude from the search. If you are a Domino shop and looking to use the GSA you may need to spend some time getting the crawler configuration just right to support the sometimes convoluted Domino query string parameters.

After we got the crawl parameters tuned and the first complete crawl done we did some testing to see if the crawler grabbed all the content. Browsing our site and testing with some strings buried deep inside the taxonomy we always found the GSA had crawled them accurately. We also did some testing with strings inside PDF documents, PowerPoint presentations and the like. When we did come across something that hadn't been crawled a careful analysis led us to discover that we needed to do some more tweaking of the crawl settings.

Other notable features

Google also gives you a KeyMatch tool that allows you to specify which indexed documents should appear at the top of the results page for a given query. These manifest themselves almost identically to the Sponsored Links at the top of the results page of the Google we all use. A Synonym tool allows you to specify alternate words or phrases for search queries. For example, if someone searches for WCM, you can suggest "Web Content Management" at the top of the results page.

An output format feature lets you control (via an XSLT) the presentation of the search results. You can use this for changing the fonts, colors, logo, header, etc. of the results page. We were able to easily remove the "Cached" feature on the results page with some XSLT modifications.

The Reporting tool lets you run reports on search queries over various time ranges. It will show you the number of searches per day, per hour, the top 100 keywords and top 100 queries for the time period specified.

Downsides

The GSA is not for organizations looking to index their shared network drives as the appliance has no facility for crawling file systems. This is really too bad as many companies struggle with the huge quantities of unstructured content on stored on their networks. Of course, there are a plethora of other products out there for exactly this issue.

Access directly to databases (e.g. SQL, Oracle, etc.) is another area which is off-limits for the GSA as well as any kind of integration with content or document management systems.

Conclusion

The Google Search Appliance (GSA) is an excellent search product for HTTP-accessible content. It gives great control over administrative features such as crawler configuration and results serving and sufficient reporting capabilities as well. Those looking for a solution to integrate directly with a content/document management system, databases, or indexing network drives should look to another product. However, if you have a intranet or intranet site with plenty of HTML-based content the GSA may be just what you need.

Bryan Mjaanes is the creator/editor of Intranet101.com, a community-based forum for Intranet professionals.

In The News:


Google News
Updated : Sat, 06 Sep 2008 20:07:30 GMT

Boeing Machinists Take To Picket Line After Talks Collapse - CNNMoney.com


Washington Post
Boeing Machinists Take To Picket Line After Talks Collapse
CNNMoney.com - 1 hour ago
NEW YORK (AFP)-- Boeing machinists took to the picket line on Saturday as they launched a strike that has halted production and could cost the US aerospace giant more than $100 million per day.
Low strike pay could limit length of Boeing work stoppage BloggingStocks
Boeing machinists walk off the job MarketWatch
Wall Street Journal - New York Times - AHN - Reuters
all 2,440 news articles

Publ.Date : Sat, 06 Sep 2008 18:46:08 GMT

'Wrestler' brings Venice fest to its feet - Hollywood Reporter


Hollywood Reporter
'Wrestler' brings Venice fest to its feet
Hollywood Reporter - 52 minutes ago
By Eric J. Lyman VENICE -- "The Wrestler" was the toast of the Lido on the Venice Film Festival's penultimate day, with star Mickey Rourke's interpretation of an aging wrestler struggling to come to terms with the end of his career creating a buzz that ...
'The Wrestler' wins Venice Film Fest top award The Associated Press
Mickey Rourke in The Wrestler. TIME
Bloomberg - Reuters - Monsters and Critics.com - Los Angeles Times
all 327 news articles

Publ.Date : Sat, 06 Sep 2008 19:14:35 GMT

Journals week in review: news from One Microsoft Way - Ars Technica


Washington Post
Journals week in review: news from One Microsoft Way
Ars Technica - 1 hour ago
By Emil Protalinski | Published: September 06, 2008 - 01:15PM CT First Gates-Seinfeld ad leaves us scratching our heads. The first Jerry Seinfeld and Bill Gates ad has arrived.
Seinfeld ads only part of Microsoft push United Press International
Windows Vista: The OS About Nothing InformationWeek
New York Times - Slashdot - Apple Insider - CRN
all 499 news articles

Publ.Date : Sat, 06 Sep 2008 18:27:03 GMT

"Bangkok" a skillful action excursion - Reuters


E! Online
"Bangkok" a skillful action excursion
Reuters - 15 hours ago
By Stephen Farber LOS ANGELES (Hollywood Reporter) - Everything is relative in the realm of hyperviolent movies, and compared to this summer's other assassination thriller, "Wanted," "Bangkok Dangerous" is a model of restraint and moral rectitude.
'Dangerous' prosperous at box office Variety
An Assassin Arrives to Turn Off the Lights New York Times
Boston Herald - Boston Globe - Roanoke Times - Buffalo News
all 251 news articles

Publ.Date : Sat, 06 Sep 2008 04:41:52 GMT

Obama, McCain to Honor Victims of 9/11 Together - Voice of America


ABC News
Obama, McCain to Honor Victims of 9/11 Together
Voice of America - 1 hour ago
By VOA News US presidential candidates Democrat Barack Obama and Republican John McCain have agreed to appear together on September 11, the seventh anniversary of the terrorist attacks in New York and Washington.
Conventions, Anything but Dull, Are a TV Hit New York Times
google news commentComment by John S. Baick Professor of History, Western New England College
Newsday - CNN - Argus Press - Bloomberg
all 4,236 news articles

Publ.Date : Sat, 06 Sep 2008 19:01:14 GMT

Buckeyes survive Ohio's threat, 26-14 - The Associated Press


The Associated Press
Buckeyes survive Ohio's threat, 26-14
The Associated Press - 35 minutes ago
COLUMBUS, Ohio (AP) - Beanie Wells didn't play. The Ohio Bobcats sure did. Lightly regarded even in the Mid-American Conference, Ohio put a scare into No.
Ohio State Football: OSU 26, OU 14 The Plain Dealer - cleveland.com
Posted by ESPN.com's Adam Rittenberg ESPN
USA Today - Columbus Dispatch - Chillicothe Gazette - SBR Forum
all 172 news articles

Publ.Date : Sat, 06 Sep 2008 19:31:47 GMT

McCain, Obama battle over economy, leadership - AFP


Telegraph.co.uk
McCain, Obama battle over economy, leadership
AFP - 51 minutes ago
COLORADO SPRINGS, Colorado (AFP) - Economic issues dominated the campaign trail Saturday as presidential hopefuls Barack Obama and John McCain battled over who could best lead the nation in troubled times.
Reuters Politics Summary Washington Post
Barack Obama, John McCain say they hate meddlers ... but both ... Los Angeles Times
Intellectual Conservative - WGN - Financial Times
all 650 news articles

Publ.Date : Sat, 06 Sep 2008 19:15:37 GMT

Add Data Feed Content to Your Website
Amazon Associate Feed


PARLOT::Ebooks, Scripts, Websites, and more...

Adsense websites

The Benifits of Assembling Your Own Computer

Thinking about getting a new computer, but with all the... Read More

What Can You Do With A Mac Mini?

If you have not seen the newest addition to the... Read More

Ink Cartridge Recycling 101

Let's take a moment to think about the topic of... Read More

Learn to Find Cheap Laptop Computers on the Internet

Cheap laptop computers are coming to a store near you.... Read More

What is ESR Meter?

All capacitors have a certain amount of resistance to the... Read More

How to Choose The Right Laptop Accessories?

The notebook computer is coming of age. For the first... Read More

Investing On A 1D Bar Code Reader

Whereas 2D bar codes offer more security and safety for... Read More

Tips To Select Proper Motherboard

Selecting motherboard is really simple. As name suggests "Motherboard" acts... Read More

How to Avoid Getting Ripped-Off When You Purchase A New Printer

It's no big secret that printer companies like HP and... Read More

A Short Introduction to Blu Ray

Blu ray is a next generation optical media format developed... Read More

Flyback Transformer-How to Locate the ABL Line

Whenever there is a complaint about monitor contrast problem I... Read More

How To Choose The Best Student Laptop?

How To Choose the Best Student Laptop?Merchants are singing those... Read More

A Guide to Refurbished IBM Laptops

IBM boasts a very large array of older and newer... Read More

10 Things to Ponder Before Moving Your Office Network

Moving the office network? How hard could it be? Anybody... Read More

Security Risks and Ways to Decrease Vulnerabilities in a 802.11b Wireless Environment

IntroductionThis document explains topics relating to wireless networks. The main... Read More

DVD Recorders: Getting Started

IMO, these sd work 'like a VCR' as far as... Read More

Five Band Resistor Color Code-What Does The Last Band Of White Color Represent?

In electronic repair, most of the time i came across... Read More

Getting to Know Inkjet Printers

Inkjet printers were born in the 1980s, replacing the popular... Read More

All In One Printers ? Home Office Workhorses

All in one printers, also known as multifunction printers (MFPs),... Read More

Selecting a Laptop

Small Can Be BeautifulWhen purchasing a new computer, you may... Read More

Used Laptop Computer: Your Quick Purchase Inspection Guide ? Part 4

Used laptop computers are everywhere these days. How can you... Read More

Three Must Have Accessories For Notebook Computers

An Optical MouseNotebooks are normally equipped with touch pads which... Read More

Ceramic Disc Capacitor-How to Accurately Test It

The last article I mentioned about electrolytic capacitor breakdown when... Read More

Printer Cartridge Economics -- Four Ways To Make Your Ink Last Longer

Printer cartridge overheads can be a major expense for any... Read More

How To Backup Your Hard Drive

We all know that we should back up our system... Read More