Search Industry Blog - When Will We Start Finding? 



About Brian Bingley

Brian Bingley authored and submitted this article.


Finding that Elusive Internet Search Tool

By Brian Bingley
The Internet, and networks in general, revolve around the passing of information between users distributed over distance. Looking at the various services it provides in a historical order can give us a clearer idea of what the Internet is and what it can do.

The Internet, and networks in general, revolve around the passing of information between users distributed over distance. Looking at the various services it provides in a historical order can give us a clearer idea of what the Internet is and what it can do

In the beginning... there was Archie, FTP(File Transfer Protocol ) and Gopher. Gopher and Archie already contained most of the components of current search engines. They had spiders crawling the web looking for content which was stored in databases and/or topic directories. Sites were ranked in a computer generated estimation of relevance to the query. Search protocols, including the search command syntax, were established from the beginning. As Archie was to FTP archives, Veronica was to Gopherspace, a search utility that helps find information on gopher servers. (See Common Questions and Answers about Veronica, a title search and retrieval system for use with the Internet Gopher). Check the Original Internet Hunt and THE ANSWERS for an indication of  what the early Internet could deliver in the right hands.

Then came... Graphic User Interfaces (GUI), in the form of Mosaic in 1993, & the World Wide Web (WWW) and existing search features were built into the new search engines such as Lycos, Alta Vista and Hotbot (See A brief history of the Lycos and HotBot search engines and A brief history of the AltaVista search engine). Alta Vista's popularity stemmed from its embrace of Boolean searches enhanced with 'case sensitivity', 'phrase searching' and a 'proximity search capability' (the NEAR operator) all of which survived until the recent takeover by Yahoo of AltaVista.

Ten little Indians...  Hotbot, based on Inktomi search results, used such features as field searching, limiting by date and searching for particular file types. It was, briefly, the search engine of choice for many but never achieved the popularity enjoyed by Alta Vista and lost direction after substituting DirectHit content for that from Inktomi. Even before the Internet bubble burst in 2001, great search tools closed or changed. You can see some of them at Searching Graveyard where they are organized in chronologic order with some of their logos.

The Swiss Army knife of search engines; or why we are googly-eyed about Google In a departure from the boolean search based technologies of the early nineties the rating of Google hits is based on their linkages (in imitation of the famed ISI Citation Indexes for academic literature) and authority rather than 'weightings' by the numbers of occurrences of keywords in the text . Google detects phrase matches even when quotes are not used in the basic search mode and it usually ranks documents with matching phrases higher. ( See Review of Google 5 June 2004; Google Advanced Search operators and The Google ~Guide Site )

Google has its limitations....: There is no nesting, no truncation, and it does not support full Boolean logic; It only indexes the first 101 KB of a Web page and about 120 KB of PDFs; The number of keywords you can search on is limited to 10 ( now 32...January 2005) but you can override this limitation by putting a plus sign ( + ) in front of any of the words when using them in a search phrase or you can use the wildcard symbol ( * ) and  actually search for more than 10 (32)  keywords at a time because the ( * ) is not counted as a word

....and special features It is currently indexing the abstract records for all online technical documents and standards by the Institute of Electrical and Electronics Engineers (IEEE); Abstracts are available free and full-text documents are available to subscribers or for online purchase; Starting a search with "define", "definition", "what is", and "what are" will invoke a Google Glossary lookup; Google will soon provide access to a 2 million record subset of more than 53 million records in the OCLC Project WorldCat - the most popular and widely available books (but see Two Million Open Worldcat Records Hit the Yahoo Database - Infotoday July 18 2004); WebQuotes - what people are saying about a particular site Google provides background information on a page if you type the URL in the form  info:www.whatever.xxx. (See also Gary Price's Tips for Searching Google and FAQ based on questions in the google.public.support.general newsgroup )

... but that aint all, Google very sensibly allows and encourages others to adapt and enhance their software as indicated by the following examples: Google Ultimate Interface utilizes all advanced search options (e.g.  Web search, Image search, News search) and  Google's tools (e.g. Glossary, Sets),toggle the Duplicates Filter on or off, use the file format search, and set the number of results per page & has links for typing non-English letters; Google API Proximity Search (GAPS) lets you look for two words within one, two or three words of each other; Google hacks by Tara Calishain and Rael Dornfest (book) -   Google Hacks - 100 industrial-strength, real-world, tested solutions to practical problems including Hack 5: Getting Around the 10 Word Limit  Hack 17: Consulting the Phonebook  Hack 32: Google News  Hack 44: Scraping Google Results  Hack 54: NoXML, Another SOAP::Lite Alternative  Hack 79: Measuring Google Mindshare  Hack 87: Google Whacking  Hack 100: Removing Your Materials from Google

There are other search engine technologies... The clustering search engines Vivisimo, Mooter and SnakeT(SNippet Aggregation for Knowledge ExTraction) show potential but are effected by the usual business manoeuverings. The clustering meta-search engine Vivisimo no longer harvests data from Google. Different tools cluster using different methods. One of the more common methods is to look for phrases which appear in multiple listings. All pages that have a certain phrase are listed in this cluster, who's name is that phrase. ( See Topic Clustering in Searches ).  Kartoo visual search and Maps of the Web use similar technology but present their results in a visual display.

Natural language searching presents a problem for artificial intelligence due to the complexity, irregularity, and diversity of human language. Ask Jeeves is the best example of a search engine using natural language. Ixquick and Surfwax are also of interest but Surfwax  like other very successful Internet technologies has been absorbed into the commercial sector ( See SurfWax Enterprise/ SurfWax Scholar /SurfWax LawKT (Knowledge Tools ). Applied Semantics' Oingo provided very effective natural language searches in limited domains but it too was quickly commercialised. It was acquired by Google in April 2004 to drive their ad products. (See Google Buys Applied Semantics )


Beyond Google...  Rumours persist of  work being done by Yahoo! and Microsoft (MSN)  to supplant Google(See MSN launches revamped search engine and  Yahoo! Search has a fresh, new look) and claims are made about Social networking search technologies such as those employed by Eurekster, Orkut, Ryze, Linkedin, delicious, and Furl but none of these appears to be in a position yet to effect a dramatic shift in web searching. (See accounts of Tim Bernars Lee's Semantic Web for more measured projections of future search technologies)

But wait. Search engines dont tell us the full story. There are many tried and true websites for searchers  Here are a selection of resources which can be appealed to immediately when appropriate:

DIRECTORIES: Keyword searching ensures maximum recall but often finds far too many hits to check easily and some of those found have limited relevance.  The hierarchical subject directories on the other hand were usually produced by human indexers and consequently excluded much of the ephemeral, the unreliable and the purely commercial sites. Whilst these are now under challenge from the clustering search engines (see above) many remain key resources, for example:
Beyond...the Black Stump  which includes Australiana and Search by ISBN (compare the prices of in-print and out-of-print books at 14 online bookstores)
BUBL LINK / 5:15 Catalogue of Selected Internet Resources
Gary Price's List of Lists  (and see his weekly newsletter)
The World Wide Web Virtual Library (WWW-VL) oldest catalog of the web by Tim Berners-Lee, the creator of html and the web itself
About - The Human Internet [formerly called the Mining Company] directory/portal neatly organizes thousands of topics, with good news and commentary.
About.com Closed Guide Relocation Directory and Assistance Links designed to help editors relocate their pages and users find the pages that have  moved
INFOMINE scholarly Internet resource collections 
Internet Public Library  - see their Pathfinders
Librarians' Index to the Internet   See    LII Theme Collection: The Olympic GamesLibrarianship   &  California and Washington Wine


SPECIALISED SEARCH TOOLS
Amazon.com "Search Inside the Book"  results list authors and titles, "excerpt from" and the hyperlinked title of the book...FAQ
Cached websites
  Gigablast, Wayback Machine, Daypop, IncyWincy, Yuntis ( See also  Finding Old Web Pages )
MESA - Meta-Email-Search-Agent    .
PINAKES, A Subject Launchpad
Voice of the Shuttle (University of California, Santa Barbara) one of the few comprehensive research subject lists with a humanities orientation.
SurfWax Enterprise/SurfWax Scholar /SurfWax LawKT (Knowledge Tools) by subscription


SEARCH PORTALS
Fagan Finder - search engines, reference, tools, and more...Biography page...Quotations and Proverbs Search
Pandia Powersearch: All-in-One List of Search Engines


DATABANKS &/OR DIGITAL LIBRARIES
Encyclopedia Britannica: The 1911 Edition   
Jewish Encyclopedia.com
New Advent Catholic Encyclopedia
Nonverbal Dictionary of Gestures, Signs & Body Language Cues 
Official history of Australia in the war of 1914–1918
Guardian Archive (since 1899)
Home Economics Archive: Research, Tradition, History
Old Car Manual Project
Spectator Text Project    Published by Joseph Addison and Richard Steele from 1711 to 1714
Technology in Australia 1788-1988


INTERNET ARCHIVES
Scout Report Archives
The Coombsweb is the world's oldest and most prominent Asian Studies online research facility. Its Web pages are designed for transmission speed, not fancy looks.
Alan Lomax Archive... Audio Archive...Film and Videotape Archive...Paper Archive...Photograph Collection.
BBC World Service Archive  international news, analysis and information in English and 42 other languages  (See also BBC Audio Interviews )


AUSTRALIAN DATABANKS & GUIDES
The AusAnthrop Database On Line
AusStage gateway to Australian performing arts
Australian Cooperative Digitisation Project, 1840-45
Australian Digital Theses Program - CAUL
Historical Australian Acts (none earlier than 1973)
Mining in Australia
Social Health Atlas of Australia
Womens Weekly Index Database
See also NLA's Electronic Australiana, Charles Sturt University Regional Archives  and SLNSW's Aboriginal Australian links

SOUTH AUSTRALIAN DATABANKS & GUIDES
Atlas of South Australia
Ground Truth: a community resource guide to the human& environmental history of South Australia   biogeographical regions, local government areas (LGA), coastal and marine mapping, aboriginal history
SASS: South Australian Sources for History and Social Science     Brian Condon's 30 year compilation includes relevant theses
South Australian Police Historical Society


CONTACTS
Free Pint Bar
International Rivers Network
Medical Expert Witness Database: Green MedicoLegal Ltd
NGO Global Network
OZLISTS: A list of Australian electronic mailing lists
Philanthropy Australia
Pitsco's Ask an Expert
Yearbook of Experts, Authorities and Spokespersons

KEEPING UP TO_DATE WITH WEB RESOURCES
Gary Price's ResourceShelf
Freepint newsletter
...Beyond the Black Stump newsletter
BUBL News
BUBL LINK Updates
NSDL Scout Reports
ResearchBuzz.com
Phil Bradley's weblog  Internet searching, web design, search engine developments and anything that will interest librarians!
Library Clips Web 2.0 oriented search blog
Top 100 Alternative Search Engines, March 2007; Feb 2007; Jan 2007
Read/WriteWeb Web 2.0 weblog ranked among Technorati’s Top 50 blogs in world...web technology news, reviews & analysis;
Search Month is a monthly newsletter that recaps stories covered on Search Engine Land over the past month.

Check for others at Google Groups and Yahoo! GroupsOZLISTS  and  Internet Resources Newsletter: Internet In Print Index