My Photo

Peeps

Fluffy Bunny

I just spent a few hours playing with Google Desktop (the product formerly known as Fluffy Bunny), and it is a beautiful thing. There was instant gratification using the tool a minute after I installed it, and as it indexes more, it becomes more useful. How did I ever live without it? Can I even call that living?

This is the Fisher Windows users have been waiting for, as I asked in March:

Google has shown me I can have it all: fast, ranked search with a simple UI and a rich query language. Is it too much to ask for being able to have that kind of search for my personal data the way I can already search the public web?
Apparently, it's not too much; Google has once again put a smile on my face and instantly made me more a productive version of myself. They have raised the bar, with a very difficult act to follow for Microsoft, Ask, X1, and all the desktop search tools Michael Wexler mentions; my only wish is that my Linux and Mac machines could have Fluffy Bunnies, too!

Rael Dornfest describes it well:

The Google Desktop is your own private little Google server. It sits in the background, slogging through your files and folders, indexing your incoming and outgoing email messages, listening in on your instant messenger chats, and browsing the Web right along with you... In evaluating the Google Desktop as an interface to finding needles in my personal haystack, one thing sticks in my mind: I stumbled across an old email message I was sure I'd lost.
Danny Sullivan has a great metaphor: "Google Desktop Search makes it easy to painlessly preserve your own archive of what you've seen and for free. It becomes, as Gary Price wished for last week, a TiVO for the web."

There's a wonderful chock-full-of-bits analysis and interview by John Battelle (and more), including a wishlist that got me very enthused:

This provides Google a major new platform to build upon -- a client application that integrates with the web. Can I imagine upgrades to that app that include spiffy new features like -- oh -- a lightweight word processor so you can take notes on your searching, or a calendar? Better yet, can I imagine Google opens this platform up to third party developers, to do what they do best? Yes, I sure can.
I would love to develop on this platform. Google Superwoman Marissa Mayer revealed several nice tidbits in John Battelle's interview, including:
The technical details of this product are stunning. It only uses of 8 megs of RAM to run. It's a 400 Kbyte file!

...The distinction between the hard drive and the net is becoming blurred. We want this application to be a sort of photographic memory for your screen...

The default rank is by date. (When we tested, we learned that) people understood the context of "when they did see this"? The results list the last time you accessed any particular document. However you can also sort by relevance. The desktop relevance scheme lacks Pagerank (of course), but it does incorporate the other 150 factors (Google uses on the web) - factors like are the (keywords) together, in bold, related, things like that.

Now please excuse me while I go play with the tool some more. Fluffy Bunny Fisher, I salute you. Marissa Mayer, I salute you. Steve Lawrence, I salute you. I need to buy a gun so I can do a 21-gun salute... time to do a web search. Hmmm. Apparently searching for a warm gun doesn't actually find me a gun, but it sure does find a lot of happiness...

Personal Web

In April 2004 during a description of the (now-part-of-Looksmart) Furl service, John Battelle defined Personal Web:

Furl saves the entire web page you've "furled", not just the URL, which prevents link rot, on the one hand, and creates what I'll call a "PersonalWeb," on the other.

Now, having your own PersonalWeb is a very cool thing. Every page you care about is now saved forever, and is searchable. How I wish I had Furl while I was researching my book for the past year. This application was inconceivable before the cost of storage and bandwidth began to fall toward zero.

But wait...there's more. You can share your PersonalWeb with others. And Mike just added a recommendation engine, so you can see links the service thinks will be interesting to you, based on what you've already Furl'd. Now, let's play this out. Imagine Furl on, oh, Yahoo, for example. Or Google. You now have a massively scaled application where millions of people are creating their own personal versions of the web, and then sharing them with each other, driving massively statistically significant recommendations, and...some pretty damn useful metadata that can be fed into search engine algorithms, resulting in...yup, far better search (and...far better SFO (Search Find Obtain) opportunities).

Speaking of SFO, imagine the business model. (Mike has, trust me.) If you have a system that has stored millions of people's PersonalWebs, webs they have literally voted for by *taking action* and *saving* or even *annotating*, then it's not such a trick to apply some contextual advertising mojo to the whole lot. After all, Web 2.0 is built on the premise that taking action - voting, in effect - can create scaled value (the best known expression of a scaled "voting system" is Page Rank- a link is a vote.)

(Aside #1: That last paragraph reminds me of NudeCybot's post Blogdex: a modern oracle or the new Pagerank? (inspired by Biz Stone's The Wisdom of Blogs): "There is a certain collective intelligence when you have the power of numbers, diversity of opinion, and freedom of both medium and content...  This is strikingly similar to the process of biological evolution whose effectiveness in what I would describe as life itself solving the problem of staying alive is related to population size, phenotypic variation and its underlying genetic variation, and availability of diverse environments with subtle variations.")

In September 2004 John revisited Personal Webs in his description of a new Ask service:

MyJeeves allows you to save results, annotate them, and then manage them in your own personal folders. Those results (and the annotation) are then searchable (as they are with A9)... Once you do, you get unlimited storage of saved results, and... pay attention here... your search history... It's like creating your own web index... the integration of search with perfect copies of what I've seen on the web...

(Aside #2: That article also talks about the imminent release of the (now-part-of-Ask) Tukaroo product integrating web search with my hard drive, which continues to push forward the evolution of Fisher as a product category.)

John's article then paints a compelling picture where a world of connected Personal Webs -- with the derivatives of information in the Personal Web also being thrown into the Information Soup that is the Global Web -- creates a virtuous cycle that keeps generating and refining information:

None of these features [Personal Webs and Fishers being two of them] are big enough to warrant a Google Moment like we had in 2000-2001. However, they all point to an incredibly robust future - and by the way, a future in which personal publishing is very much integrated into search, and vice versa. Just a thought, but once a critical mass of folks are saving searches, search results, annotations, and the like, sure as shit they'll want to share them, publish them, and cite them (and sure as shit, engines will want to crawl em for relevance mojo). Just watch as search, blogging, and RSS start to feed off one another.

Let's put aside copyright issues -- I'm assuming the New York Times, for example, has as much problem with me saving a personal copy of every (copyrighted) page I've read on nytimes.com as the RIAA has with my ripping mp3's from my CD collection (no DRM?? for shame!) -- and stick to the issue of whether my Personal Web is big?  How much work would a Fisher that indexes my Hard Drive and my Personal slice of the public Web be?

Back-of-the-envelope calculation #1: PersonalWeb.

An approximation of the size of my Personal Web is a function of how many pages I've read, how long I've been reading, and what the average page size is.  That is,

Personal Web Size = Avg # Pages/Day x # Days x Avg Page Size

I've been on the Web for almost 12 years. Assume a high estimate of the average number of web pages I've visited per day, like 100.  I don't know what average web page size is, but average web page size used to be 60k a few years ago. Let's call it 100k now to account for bigger images now available due to wider broadband adoption, knowing full well that like the average of 100 pages a day I read, the estimate is high.  That would make an upper bound on the size of the Personal Web I've personally seen is 12 x 365.25 x 100 x 100k = 4383 days x 100 pages/day x 100,000 bytes/page = 43,830,000,000 bytes, or less than 44 Gigabytes.

Wow. I already have more than that space available today on my hard drive (and/or in my 100 Gigabyte email account, so if I had a Personal Proxy that seamlessly stashed a copy of every Web page I've ever seen, I could easily store it today, and then I just have to focus on how to index and search it.

Back-of-the-envelope calculation #2: PersonalWeb++.

Assume we live in The World Of Tomorrow, where hard drives have many Terabytes of storage.  Even though I'm regularly slurping down huge amounts of content via BitTorrent and eMule -- gotta love decentralized filesharing -- I still have plenty of space (and CPU power) not only to keep on my hard drive a copy of my entire Personal Web (it grows at 3.65 Gigabytes a year), but have my own Personal Spider that regularly crawls, on my behalf, the entire contents of any site I've ever visited (including but not limited to all my favorite blogs and RSS feeds), without ever clobbering any of the content it has ever fetched for me, so that nothing ever gets lost in the ether of, say, Archive, where I can never find it again.  How many websites have I visited over the years?  Let's be conservative and say I visit an average of 50 websites a day, year after year, but how many of those are unique?  Let's conservatively say I discover on average 10 new web sites a day -- and I know that number is high if I average it over decades.  How big is the average website?  Let's be really conservative and say that it's 1000 pages.  Using our earlier assumption of 100k per page, I have 10 new sites/day x 1000 pages/site x 100k bytes/page, or 1 Gigabyte per day on average of new material to add to the store of all information around the periphery of my Personal Web.  Over 12 years (or 4383 days) the upper bound on the amount of storage I would have needed to save my PersonalWeb++ is therefore less than 4.4 Terabytes. Uncompressed.  Which seems small if I have a 100 Terabyte hard drive at my disposal.

So maybe a Fisher that indexes and offers up PersonalWeb++ is a big idea after all.  At least if storage is the metric by which we measure the bigness of an idea. Which gets me wondering...

Back-of-the-envelope calculation #3: How big is the Web today?

Four years ago, BrightPlanet did a deep web study, and determined that "Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web. The deep Web contains 7,500 terabytes of information, compared to 19 terabytes of information in the surface Web. The deep Web contains nearly 550 billion individual documents compared to the 1 billion of the surface Web. More than an estimated 100,000 deep Web sites presently exist. Sixty of the largest deep Web sites collectively contain about 750 terabytes of information – sufficient by themselves to exceed the size of the surface Web by 40 times."

Fast forward some.  18 months ago a blogger wrote, "Google has indexed over 3 billion. I thought the common wisdom was that there are over 20 billion pages."

How about today? Well, Google says it currently indexes 4.3 billion web pages.  Let's assume for fun that they only have 10% of the actual existing Global Web in their wonderful little paws.  43 billion web pages times an average of 100k per page is 86 quadrillion bytes (what's that, 4300 terabytes  or 4.3 petabytes?)... sounds like a lot.

We've certainly come a long way since 1993, and we still have a long way to go before we get to yottabytes.

All this talk of big numbers reminds me of the excellent ACM Queue article by Peter Lyman and Hal Varian, How much storage is enough?, with oodles of nice soundbites (sound bytes?):

In 1999, the world produced about 1.5 exabytes of storable content (1018 bytes). This is 1.5 billion gigabytes, and is equivalent to about 250 megabytes for every man, woman, and child on earth. Printed documents of all kinds make up only .003 percent of the total...

The tough part is digital content, though that's the most important component. According to our estimates, more than 90 percent of information content is born digital these days. Conversion to ASCII, MP3, MPEG, and other compression technologies dramatically reduces storage requirements by one to two orders of magnitude.

If all printed material published in the world each year were expressed in ASCII, it could be stored in less than 5 terabytes...

We expect that digital communications will be systematically archived in the near future and thus will contribute to the demand for storage. In 1999 we estimated e-mail to be about 12 petabytes per year, Usenet about 73 terabytes, and the static HTML Web about 20 terabytes. Many Web pages are generated on-the-fly from data in databases, so the total size of the "deep Web" is considerably larger.

They updated the survey in 2003 with the wonderful site How Much Information?:

The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.

Ninety-three percent of the information produced each year is stored in digital form. Hard drives in stand-alone PCs account for 55% of total storage shipped each year.

Over 80 billion photographs are taken every year, which would take over 400 petabytes to store, more than 80 million times the storage requirements for text.

One terabyte, the smallest practical measure for our project, is a million megabytes, which is equivalent to the textual content of a million books. An exabyte, which is what we use to report the final results, is a billion gigabtyes.

  There are plenty more good nuggets in there -- see the sound bytes, charts, and summary.

Size of Searchable Internet

But reading that material got me wondering. "If all printed material published in the world each year were expressed in ASCII, it could be stored in less than 5 terabytes."  Can you imagine the day in the future where everything published in the entire year were shipped to anyone who wanted it?  (Again, ignore copyrights.)  In less than 100 years will we have a hard drive small enough to carry around yet big enough to store everything anyone ever published? That's when Personal Webs -- not just the annotations of what I've read and wrote, but the annotations of people I trust -- can guide the probable answer I'm seeking by providing tips to search engine about the kind of information I like.

I leave with a final thought.  How much stuff can I read in a lifetime?  (Not just web, but books, and magazines, and the backs of cereal boxes... how many bytes will that be?)

Back of the envelope calculation #4: How much do I personally read?

Assume I'm a whiz who speed reads, so I can read a page of text in a minute, and that a page of text is 250 words.  I'm college-fed, so assume that an average word in my kind of reading is 8 characters.  Then my average page is 2000 bytes of raw ASCII, or 2K.  How many hours a day do I spend reading?  Assume I'm an egghead who can multitask, so I average a solid 6 hours of reading per day, whether at work or at play or in meetings, or what have you. 6 x 60 = 360 minutes of reading per day.  That's 720k per day, which admittedly seems like an upper bound.  How long will I live?  Well, I have an unhealthy lifestyle, but living long runs in my family -- I have four grandparents who all passed 80, and three who passed 90 -- so assume I live 90 years, and chop off 10 because I was a slacker when I was young.  80 years of reading times 365.25 days per year is 29,220 days of reading before I kick the bucket.  And that's if I'm lucky.  But we're interested in establishing upper bounds here, so it'll do.  29,220 days times 720k per day is 21,038,400 kilobytes that I could read in a lifetime.  In other words, with my wildly conservative estimates an upper bound on the amount of text I could possibly read in a lifetime is 21 Gigabytes.

Personal Webs with recommender systems that take into account what I like to read and write, and what people I trust like to read and write, is the only way to make sure those 21 Gigabytes count.

Search Voyeur

Magellan's Search Voyeur back in the mid-1990's was a self-updating list of 20 randomly-selected, real-time search requests from the Magellan search engine. It was located at http://voyeur.mckinley.com/voyeur.cgi but now there's nothing there. (Magellan was bought by Excite, which released it as Excite Voyeur, but that was long ago in a galaxy far away.)

I've been a search voyeur watching what other people search for ever since the days of Magellan Search Voyeur. Which is why my eyes went "blink blink" when Cool Mel pointed me at Dogpile's SearchSpy:

It's not as cool as the projection display in the lobby of Googleplex, but you'll get the same feel from SearchSpy @ Dogpile (hint: unfiltered is the way to go). Or better yet, if you're a search junkie, you can try out this Search Engine Belt Buckle. Groovy!
I admit it, I'm a search junkie. And I love that Joi Ito pointed us at the search engine belt buckle.

Lookout, Software!

It's been four months since I wrote my Fisher-as-product-category post, and now I see that Microsoft has bought Lookout Software.

(As an aside, when I worked for Microsoft in the late-1990s, my team used to call the ever-crashing Outlook client, "Lookout!", which also happens to be the name of the punk rock record label in Berkeley where Green Day got their start... :)

Kristi Heim, SJ Mercury News, July 16, 2004 (the emphases below are mine):

Microsoft, the world's biggest software company, has thousands of programmers and billions of dollars to spend on developing search technology to compete with the likes of Google and Yahoo. But today, Microsoft is buying a key piece of search technology created by one guy in a Palo Alto guest house in his spare time, with help from his friend... Microsoft is acquiring Lookout, a two-man Silicon Valley company that offers a free, downloadable search engine for Microsoft's own Outlook e-mail software.

MSN Director Lisa Gurry said Microsoft sees Lookout as ``strong technology'' that fits in with Microsoft's long-term vision for search. Lookout puts a search toolbar into Outlook that is similar to a Google search box. It lets users enter words into the box to find information within e-mail messages or in files on the computer, working much faster than Outlook's built-in tools.

Gurry said Microsoft will roll Lookout's technology into a new MSN search service that lets people find information in e-mails, desktop computer files and other locations besides the Internet.

It's part of a recent $100 million push by the software giant to beef up its search offerings. Earlier this month, the company streamlined the MSN Search page and offered a preview of search technology it plans to launch later this year.

The Lookout search engine was initially written by Eric Hahn, a former chief technology officer at Netscape who runs venture capital firm Inventures Group.

Hahn said he wanted to get back into programming, so when he wasn't busy at his ``day job,'' he started dabbling with ways to make Outlook more searchable. Last spring, he made an early version called ``Chrome'' available on the Internet as a free download. Thousands of people tested it and offered suggestions.

As the number of users grew, he needed help managing the project, so he called on his friend and fellow Netscape veteran Mike Belshe.

Hahn and Belshe worked on the project in a mother-in-law cottage next to Hahn's house, launching an improved search engine called Lookout early this year.

They never had any formal contact with Microsoft until an executive from the Redmond, Wash., company called them for a meeting in Mountain View this spring, Hahn said.

The product works with Outlook so well that it has won many fans among Microsoft employees, Gurry said.

Belshe is joining Microsoft's MSN Search team. Hahn said he will advise Microsoft during the transition.

Hahn and Belshe, who were planning to commercialize Lookout, agreed to sell the company to Microsoft under one condition: Existing users will be allowed to keep using the technology even though Microsoft plans to stop offering downloads of Lookout as of today.

They estimated the number of Lookout users at less than 100,000.

``It's such a useful tool,'' Hahn said. ``As an independent company we can get it to so many millions, but Microsoft can get it to everybody. We really felt having the Microsoft machine behind us was a huge advantage.''

My "Rifkin's observations":
  1. Lookout is one of the most promising of fifty startups working on Fishers. (Other promising ones include X1, dtSearch, and Bloomba.) Microsoft gets a little edge against Yahoo and Google in going after this space as a result.

  2. The division of Microsoft most interested in this is not Longhorn, Outlook, or Exchange. It's MSN Search -- the folks who are going head-to-head with Yahoo and Google. The most significant thing I speculate from this fact is that everyone who matters in Microsoft believes that the search functionality of the first Longhorn (now due in 2006 at earliest, and 2007 at likeliest) will be substandard, so MSN Search is taking on the weighty task of going after Yahoo and Google. My own belief is that we should expect nothing better than the "Windows Search Dog" from Longhorn until at earliest second-generation Longhorn, or 2010.

  3. Microsoft is buying this company for one guy, Mike Belshe. My speculation is that they're purchasing him more for his expertise in parsing .pst and .ost files (and other esoteric mail file formats) than for any other developing expertise he might have. (If two guys outside Microsoft could code the Lookout software in under a year, couldn't Microsoft just throw a dozen internal folks at the problem? Sure, but they probably would have gotten steep resistance in turf wars from the Outlook, Exchange, and Longhorn folks to fork over file formats -- which I'm guessing Belshe and Hahn probably spent extensive time grappling with and/or reverse engineering.)

  4. It took two guys and less than 100,000 downloads to make a big splash in the Fisher product category space. If Lookout is like Altavista, then there is still an opportunity for a couple of guys to create "the Google of this space". I've had several friends go on job interviews with Google, and their internal product codenamed Puffin doesn't sound like it has any strategic advantage over others in this space the way PageRank represented marketing (and possibly technical) advantage over other Web search services in the late 1990s.
Here are some of my notes on Puffin:
  1. John Markoff, New York Times, May 18, 2004:
    Edging closer to a direct confrontation with Microsoft, Google, the Web search engine, is preparing to introduce a powerful file and text software search tool for locating information stored on personal computers...

    Improved technology for searching information stored on a PC will also be a crucial feature of Microsoft's long-delayed version of its Windows operating system called Longhorn. That version, which is not expected before 2006 at the earliest, will have a redesigned file system, making it possible to track and retrieve information in ways not currently possible with Windows software.

    Google's move is in part a defensive one, because the company is concerned about Microsoft's ability to make searching on the Web as well as on a PC a central part of its operating system. By integrating more search functions into Windows, Microsoft could conceivably challenge Google the way it threatened, and destroyed, an earlier rival, Netscape, by incorporating Web browsing into the Windows 98 operating system...

    Although Google's core business rests on huge farms of server computers that permit fast searching on the Internet, the company has already taken several steps to move beyond that business.

    Last year, Google began testing a free program called the Google Deskbar that makes it possible to search the Web by entering words and phrases in a small dialog box placed in the Windows desktop taskbar at the bottom of the computer screen.

    Google also sells a computer search system designed to index and retrieve information created and stored by a single organization.

    There is a rich history of less-than-successful attempts to create information search tools for personal computers. In the 1980's, for example, Mitchell Kapor's On Technology developed On Location for retrieving information on Macintosh computers and Bill Gross, a prominent software developer, led a group of programmers to create Lotus Magellan for the PC.

    Digital Equipment's Alta Vista search engine group also developed a search tool for data stored on desktop PC's. Today there are a number of commercial products for desktop searches like X1 and dtSearch. Moreover, both the Macintosh and Windows operating systems have file and text retrieval capabilities...

    The Google software project, which is code-named Puffin and which will be available as a free download from Google's Web site, has been running internally at the company for about a year.

    The project was started, in part, to prepare Google for competing with Windows Longhorn, which according to industry analysts will dispense with the need for a stand-alone browser.

    The disappearance of the Web browser and the integration of both Web search and PC search into the Windows operating system could potentially marginalize Google's search engine. Google, well aware of this threat, hired a Microsoft product manager last year to oversee the Puffin project as part of its strategy to compete with Microsoft's incursion into its territory.

    Microsoft has shown demonstrations of its new search technology, which emphasizes the use of natural language in queries like "Where are my vacation photos?" or "What is a firewall?" Microsoft believes that Longhorn users will no longer think about where information is stored; they will instead see a unified view of documents stored on both the Internet and on the desktop.

    The looming confrontation between Microsoft and Google is coming as Microsoft prepares to introduce its own advanced Web search service, possibly later this year. The company is revising its MSN strategy and backing away from its Internet dial-up service, looking instead to get more revenue from the search advertising market that Google dominates.

    Web and PC-based searching is a particularly thorny subject for Microsoft because the company's chairman, Bill Gates, first outlined the idea of "information at your fingertips" in a speech given at a computer industry trade show in 1990. Yet the company did little to innovate in the areas of Internet search or text and file searches on the PC until it discovered how profitable search had become for Google.

    Google's strategy is to move quickly while Microsoft is still developing its Longhorn version of Windows, adding programs and services like its recently announced Gmail electronic mail program. The intent, say people who are aware of the company's strategy, is to lower its vulnerability to Microsoft by adding businesses that are "sticky" - in other words, businesses that create strong customer loyalty or are hard to switch away from.

    Internet searching is widely seen by industry executives as a powerful commercial service, but one that is difficult to defend. It is widely presumed that Internet users who find a search service that is better than Google's will be willing to defect.

    Searches for information stored on a PC, however, could offer an advertising arena that is more readily defensible. Indeed, desktop searching might be particularly valuable for Google's commercial advertisers, which may be willing to pay dearly for the ability to place targeted ads in front of personal computer users.

    Such services, while they may be lucrative, will also inevitably force Google to deal with new controversies. Some privacy activists have opposed the Gmail service because they are concerned that the company is automatically extracting information from its customers' Gmail accounts.

  2. Stefanie Olsen, CNET, May 21, 2004
    Google is reportedly preparing to release downloadable software that enables people to search for text and files stored on their computer's hard drive. The move would dramatically expand Google's search business beyond the Web while taking direct aim at Microsoft, which is itself getting ready to take on Google's dominance in Web search with its own technology.

    Although Google would not confirm the existence of the project, called "Puffin," industry watchers have expected such a move for some time. Having announced plans last month for a $2.7 billion initial public offering of its stock, Google is accelerating efforts to increase revenue and expand into new markets on a number of fronts.

    By broadening into desktop file search, Google would put two businesses to the test. First, it would expand its Web-search advertising -- its primary source of revenue, with sales of $914 million last year -- to an ad-supported application running on the desktop. That would put Google much closer to controversial companies such as Claria (formerly Gator) and WhenU, which have been caught up in a growing consumer backlash against "adware" and "spyware" products.

    Second, Google would take what it's learned in building an enterprise search application and bring it to the masses. That's no easy task, considering that Google failed to storm the enterprise search market when it introduced the Google Search Appliance in September 2002. The product makes up a fraction of its business.

    The Microsoft factor But desktop file search poses vastly different problems than Web search does, and the company could easily be trumped by operating system makers such as Microsoft, whose Windows software runs on more than 90 percent of the world's PCs.

    Microsoft's OS dominance has been credited in the past with helping the software giant muscle into fresh territory by bundling new features in Windows--a key allegation the U.S. Department of Justice's antitrust suit, filed against the company in October 1997.

    In a Securities and Exchange Commission filing announcing its IPO, Google flagged potential Microsoft tactics as a possible threat to its business on the Web. In an overview of risk factors facing the company, Google speculated that the software giant could one day seek to interfere with its ability to index certain kinds of documents on the Web.

    Such concerns are even more pertinent when it comes to the desktop, where Microsoft holds powerful levers to promote its own products over those of rivals.

    According to a report in The New York Times, Google will try to fulfill an unmet need among PC users for tools to easily find information across multiple applications on the hard drive -- searching through e-mail, text documents in various formats, music, and photos files, for example. Consumers would likely be the primary audience for such a tool, but it could easily infiltrate workplaces, too.

    Apple Computer already offers an elegant tool built into Mac OS X to perform many of these tasks, but it only works on its own Macintosh line of computers, which account for less than 5 percent of the market. Although Microsoft includes desktop search software as part of Windows, it is unwieldy, and most users rely instead on self-managed file folders to organize their archives.

  3. A Yahoo article that is no longer available (damn you, Yahoo, how cheap is storage and yet you delete articles less than two months old):
    "Microsoft is looking to protect the operating system and their control of the desktop. Everything Google does on the desktop is about protecting their Internet advertising," said David Thede, president of dtSearch Corp., which with Argo Technology powers the free Terra Lycos HotBot Desktop toolbar that allows users to search the Web, e-mail and PC files.

    "I have yet to see a real clash of interests between Microsoft and Google," Thede said...

    To effectively challenge Google on the Web search and advertising fronts, Microsoft would have to match Google's massive infrastructure that is widely believed to include more than 100,000 servers -- an investment analysts said it may or may not choose to make.

    Industry players said a move into desktop search would be an intelligent and natural extension of Google's business, but not without challenges -- chief among them being how to make money from the effort.

    Most providers of desktop search, including X1 and dtSearch, focus on corporate users who are used to paying around $100 to $300 for software. Scores of other software makers are in the business of providing tools to quickly locate information on PCs, and the landscape is littered with the corpses of companies that have failed.

    Google users are accustomed to getting free services in exchange for ads. While Web-search advertising has been a home run for Google, its Gmail product that delivers ads based on the content in e-mail has sparked a storm of protests from privacy advocates and may not be as lucrative.

  4. Stefanie Olsen, CNET, June 8, 2004:
    AltaVista, now owned by Yahoo, was among the first to take a stab at desktop search, but its product failed to catch on. Since then, a slew of companies have developed downloadable software applications to address the problem, including Copernic, Groxis, Enfish, 8020 and X1 Technologies. None have gathered critical mass.

    Research firm IDC has estimated that sales of software for search represented a $617 million market in 2003.

    "It's a tough market, lots of companies have come and gone," said Andrew Feit, a senior vice president of marketing for corporate search technology provider Verity.

    Although Google has mainly avoided controversy over its Web search ads, it runs the risk of alienating consumers if it misplays its hand in a downloadable application that aims to sort through private material, critics say.

    Adware companies such as Claria and WhenU are trotting out new desktop applications to appeal to consumers and support their ad businesses. Claria and WhenU began by bundling their advertising software with other popular file-sharing applications so they could increase the number of people they might track for ad purposes. These companies monitor people as they surf the Web and send targeted ads based on their behavior. The practices have landed them and many others in court, where they have argued for their right to deliver ads to the Web sites of their customers' rivals.

    In a sign of growing overlap between Web search advertising and ad-supported desktop tools, Yahoo's Overture subsidiary has struck a deal to display tiny text advertisements through Claria and WhenU.

    State and federal governments are now interested in regulating and perhaps even banning adware and its more controversial cousin, spyware. Utah has already enacted such a law, and the U.S. House of Representatives and the Federal Trade Commission have convened hearings on the issue in the last few weeks.

    Google may be backing self-regulation in advance of widespread laws. This week, the company released a set of suggested principles for software makers to follow when writing programs that embed themselves on Internet users' PCs. The guidelines propose that an application should follow simple rules of politeness: It should admit what it's doing, permit itself to be disabled and not do sneaky things like leak personal information.

    Yet even if it applies such best practices, Google could still land in hot water. Given that the company already has access to information about people's search histories and Web surfing behavior and will do so about their e-mail communications through its upcoming Gmail service, Google could take heat from privacy advocates and consumers.

    The company already makes the Google Toolbar, Deskbar and other products for Windows that transmit some information about Web surfing behavior back to its servers. Under proposed laws, these tools could be regulated, as would its upcoming ad-supported desktop search software.

    "What's happened is that there's a trend of going from search to publishers to the desktop. After looking at the beginning of that market with Claria, the question is: How do you make it a consumer experience that they not only want, but also aren't offended by?" Highland's DeSilva said.

    Those concerns over embedded software are unlikely to affect Microsoft, whose upcoming integrated search tools will probably be kept free from advertising.

    Software challenges Google also faces considerable hurdles in the technology side of desktop search.

    "So many people equate search with Google, but in fact, there's an entirely different market for enterprise search software. And it is a complex problem to solve," said Sue Feldman, a vice president of content technologies research for IDC.

    Google introduced an application for searching corporate intranets and desktop files two years ago. But the software makes up less than 5 percent of the company's business, or less than $48 million last year, according to the company's IPO filing. While Google has a couple hundred enterprise customers, it hasn't been as successful in that sector as it has in search and advertising.

    Google has become popular because it's helped to improve Web search by delivering fast, relevant results. But its formulas for the Web that rely on the link structure of Web pages are unlikely to translate well to the PC environment, as files and documents on the PC don't contain an inherent link structure.

    One answer is to embed a common "sticky" note to applications and documents that would let people label these with a few keywords. That would make it easier to retrieve the files down the road. Application makers such as Adobe Systems and OS makers such as Microsoft are in a prime position to develop such tools.

    Another approach, now under development by Microsoft, is to create intelligent documents with XML (Extensible Markup Language) links. These would enable people to input information into one document and funnel that data to other, relevant applications. Search tools would be built in, so related information could be found in disparate applications.

    Autonomy, Convera and Verity are all companies that are working to solve these enterprise search problems and typically offer much more robust technology than Google's enterprise technology. Google's system tends to focus on simplicity and works particularly well with HTML-based documents.

    "Google's real challenge will be in adoption: getting people to download and install it," independent analyst Matthew Berk said. "In order to search your hard drive, you need to install something that's pretty intrusive, that can reach deep down into your machine."

A friend of mine, Kevin Compton, believes that Searching is the programmable platform of the next decade, the way that app servers and web servers were the platform of the last decade, and operating systems were the platform of the decade before that. I'm inclined to believe him. In that context, the Google/Microsoft/Yahoo showdown represents no less than a land grab for the hugest development platform to date in the software industry.

Fisher As A Product Category

I've been thinking a lot lately about Personal Search Engines -- and about Fisher as a category of desktop software that indexes your email (and, as the product evolves, the content in your files, most of which have gotten to your hard drive through email, but also some that arrived through the Web, RSS feeds, instant messaging, and so on). I'm not the only one.

Three days ago, David Weinberger posted an enthusiastic endorsement of X1: "I use it maybe 5 times a day. Now X1 is starting to market itself. Good. It's worth the $100 in time savings alone. It's held up well as my email archive has grown to 110,000 messages." Actually, I don't see a marketing campaign. A scan of Google News for X1 reveals just one press release, which isn't even on X1's press page. However, I have been following X1 for about a year now, and I did download and try their product 3 months ago. And I can see clearly the vision in Nate Koechley's comment:

It's wonderful, and will change how you think about your information. Gone are the days of extensive folder structures in Outlook (or your client of choice). Now, it doesn't matter where the message is, you can always find what you want in the same 2 seconds.
This is clearly a compelling vision: a Fisher has the potential to change my life. It is the starting interface I go to for all of my personal information (just as Google is the starting interface I go to for all public information), and it saves me the time I would have otherwise spent organizing email & files and waiting for search queries to complete.

Two days ago, John Battelle cited David's post in his Searchblog, and cites several good reasons why Fisher-as-a-Product-Category solves a problem that a lot of people have, and that gets worse each day:

Desktop search (ie searching your own hard drive) is one of those things that seems to have gotten worse in the past ten years ... I've got 40 gigs, I think, but no desktop search utility (Sherlock doesn't have text string search, far as I can tell). My email, for example, is a thicket of badly organized folders.
X1 currently runs only on Windows; John asks if there are any such products available on Mac OSX. The only first-generation Fisher that I can think of for OSX is the beta-grade open source project called Zoë. In reading the comments to John's post, I see that a few people suggested Zoë for searching personal email archives. Like X1, I believe Zoë is good -- good enough to criticize, if not use daily.

First, let me note with due regard to my friend Raphaël Szwarc (Zoë's original visionary and author) and fellow Caltech alum Bill Gross (X1's original visionary) that "good enough to criticize" is actually fairly high praise (as Alan Kay famously noted). There are a lot of technologies that do not qualify as Fishers because they are too raw to be useful to end users: toolkits like Doug Cutting's AIAT (neé V-Twin), Lucene, and Nutch, for example, are not 'complete end user products'; the current Windows XP "Search For Files Or Folders" Dog is silly and slow; the current Outlook 2003 Email Search facility is serious and slow; and grep gets stuck in the throat of anyone who's not a UNIX power user.

Back to X1 and Zoë. In my experience so far -- and remember, I'm a power email user with 20 Gig of email saved over the last 15 years -- neither of products these is good enough to put up with on a regular basis yet.

I believe those who suggested Zoë have never tried actually using it. Rohit Khare and I tried it on a Mac, and we found the system to be a good demonstration of the idea but not something we would actually use as the primary interface to all the information privately available on our desktop (the way we use Google a hundred times a day as the primary interface to all the information publicly available on the Web). The blog-like UI, which seems interesting and novel at first, gets real old real fast. More importantly, as mailbox size increases, Zoë becomes unbearable to use for search because:

(1) Its search results are not complete -- there appears to be an arbitrary cutoff of the total number of items returned;

(2) The search results themselves are sufficiently indistinguishable from a random order -- either Zoë is using Lucene poorly, or Lucene works poorly for this application, because even the results in most-recent-order-first would be an improvement;

(3) Searching with Zoë for any mailbox bigger than a few Megabytes is slow -- slow enough that it doesn't transcend Heidegger's categories from 'present-at-hand' to 'ready-to-hand' (like, say, Google itself); and

(4) No simple query syntax like Google offers: not just booleans, but operators like to:adam or subject:cheese or attach:ppt.

Evaluating X1 on Windows reminded me of my evaluation of Bloomba. They stake their claims to fame on speed (that is, until my email gets to be greater than 100k messages, but apparently this is a problem very few of us have right now). However, their UI's are clunky fat clients that demand I change some aspect of how I work. With X1, I have to give up a half-inch of my screen for an unnecessarily-modal four-tab UI for searching either email text, attachments, files, or contacts; with Bloomba, I have to drop Outlook (or Eudora or Netscape), with all of the charming foibles that I've gotten used to in my mail client over the years.

In either case, there isn't a decent query syntax -- heck, X1 is no more than 1970's-era KWIC (keyword-in-context) string matching with a bizarre insistence on keystroke-by-keystroke redrawing as if it's trying to nag me into acknowledging its "speed"). Without the kind of excerpting that Google does in step "K", I find myself wading through dozens if not hundreds of hits for most queries... which brings me to a more important point: There is no ranking that makes the results better than grep. At least (unlike Zoë) X1 and Bloomba return matches in order of most-recent-first, but for quality search results, Google has proven to me that ranking is of utmost importance. There is no ranking algorithm a la PageRank that acknowledges even the simplest truths about my mail (stuff from Rohit ranks higher than Orkut notifications, say :).

Many folks can't even imagine 100K messages, but I'm closer to a million (!). Sounds absurd, sure, but the design target for Microsoft Longhorn is supposedly 1 terabyte PCs! And that speaks to another basic criticism of all the aforementioned tools: email may be the center of my universe, but not the entirety of it. How about an "image search" of my hard drive that didn't require me to laboriously pre-caption each photo? Or a "version search" of our latest spec sheet that doesn't trip up on the fact that there are 32 separate Word attachments that all contain the same paragraph over the last year? Or a way to search all the web pages I've visited before? There's hard drive to spare -- why not cache everything?

To put it bluntly, Zoë and X1 are good first-generation manifestations of Fisher As A Product Category. But this product category must evolve before these products are a must-have for everyone who feels the pain of finding private information in-my-email-and-on-my-hard-drive:

In Summary
Google has shown me I can have it all: fast, ranked search with a simple UI and a rich query language. Is it too much to ask for being able to have that kind of search for my personal data the way I can already search the public web?

By the by, several other folks in the ensuing discussion linked to projects I don't consider "good enough to criticize" (yet):

Jwz's Intertwingle insight is still a manifesto with which I wholeheartedly agree, but somehow it hasn't magically been implemented spontaneously by the open source community. Perhaps Chandler will do better -- it has great architecture plans -- but in my opinion its attention is too focused on competing with Outlook-the-GUI.

Launchbar is pretty much like locate(8) for non-UNIX types -- I love the single-keystroke access to its ultra-minimal UI, but I'm not a developer enough to be excited about searching .h files for method names. I need to search restaurant names in unstructured text...

And Spring and Scopeware, for all their promise, have even clunkier fat-client UI's than X1 and Bloomba. I don't need any more User Interfaces in my life!

Fisher

(The following post is cut-n-pasted from an email conversation I had recently about a Personal Search Engine with Jeff Barr. I'm posting it on my typepad so that I can use Google to find these notes again. I note the irony that if I had a Fisher, I wouldn't need to blog it publicly to be able to find it with certainty later when I need to find it.)

I've been thinking long and hard lately about finding things in my (20 Gigabytes representing 15 years of) email archives (or, as Rohit calls it, The "Search My Mail D*mnit" Problem) so I can properly deal with Ham. In the email must evolve context, I think ZOË-as-Personal-Server points in the right direction.

I find myself wondering if there is still an opportunity to launch a desktop search product that fits the classic definition of platform. The equivalent of a "Browser" for the next decade that brings together existing disparate tools by mixing SMTP and HTTP and throws in a healthy dose of instant messaging and RSS -- except that instead of browsing for information it lets you go (for lack of a better word) "Fish" for information. It's got a simple browser interface and query language (like Google), is lightning fast (due to regular re-indexing), and offers search results of your personal stuff in that simple UI.

Rohit is three steps ahead of me here -- that any good "Fisher" of all your emails, IM's, desktop files, web history, and RSS feeds needs a great algorithm to rank the results of your "Go Fish" queries. (Ranking is something ZOË doesn't do, and therefore it cannot handle the volumes of email I receive daily.)

A "Fisher" isn't a replacement for our existing PIM's and browsers and IM clients and RSS readers, in the same way that Google doesn't replace the Web. Actually, Google is a good analogy since it provides a set of ranked results for any given query of the Public Web. But Google is an anonymous search of the Public Web of information. Fisher, by contrast, provides a set of ranked results for any given query with a personal search of your Private Web of information. It's a customized search of your personal stuff.

I don't have a better sketch of the opportunity yet, but if it works for your personal stuff anything close to the way Google works for public stuff, I think it's a killerApp just waiting for an auteur to get around to writing it. The platform play of course is that it is scriptable -- not just the query engine but the "tap on your ethernet" which can watch HTTP and SMTP traffic crossing your machine and do things on your behalf based on inspecting all the data that crosses through.

I have to think about it. Like I said, I'm sure Rohit is three steps ahead of me on this one... as was jwz six years ago in his write-up of Intertwingle (and short discussion), "a potential project to make it easier to deal with a massive volume of personal messages: excavating, traversing, relating, reporting, annotating. Intertwingle can be seen as a unification of a search tool and an address book. It is not, however, a mail reader. The presentation of query results could be done through a mail reader, but the intention is that ones choice of mail reader should be orthogonal to the use of this tool. The two kinds of tools just happen to operate on the same data." Might this be what the Twingle effort of Kasei is all about?

Update, 3/4/2004 at 5am. Rohit emailed me a description of Fisher in his own words:

Personally, I'm getting very aggravated by the irony that it's easier to find stuff on the Internet than on my own PC. The vast majority of this problem is email, specifically. And while email is impoverished in hyperlinks, ruling out the sorts of PageRank algorithms web search engines use, it is very rich in social network information. The correspondence information can help us choose which bits of text are likely to be the most relevant hit for a query, because it does matter who said what.

Admittedly, this may have something to contribute about searching multiple-agent discourses in general -- anywhere you can clearly identify authorship of a snippet, and thence calibrate which authors a user reads most.

Of course, if we simply ranked people by frequency of interaction, it would be kind of boring. Frame it as a simple principal-eigenvector problem -- count who reads your readers, and so on -- and interesting patterns emerge. It could well be as useful as PageRank itself was, by comparison to text search alone.

Imagine the Google UI for your own PC. You aim your web browser at localhost and get back a results page that looks eerily familiar, but the hits are actually documents, mails, photos, and cached web pages from your own personal archive. This means nailing the challenges of grouping similar results, generating short excerpts, converting file formats for indexing, and so on. There's more magic to how it installs -- you don't change your email client at all! -- but that's just more technology.

I believe Intertwingle is a good early manifesto about Fishers as a product category, and I think that ZOË and X1 are good first-generation instantiations of Fisher-as-a-product. Looking around SearchTools.Com I don't see any other Fishers... yet.

See also: Fisher as a product category, Lookout, Software!, Fluffy Bunny (aka Google Desktop).

Search BaRF

Tim Bray has been thinking a lot about search lately. He's looking for his next gig, preferably one that will let him work on his Basic Resource Finder (BaRF) vision:

BRF is not going to be built to try to replace Google or anything close to that; which means that there will be no machinery for index partitioning nor massive-scale parallelism. The reason is that I think that the brute-force application of an ordinary server with a lot of RAM ought to be able to provide all the search muscle just about any imaginable enterprise-scale search problem needs. The Web, that’s a different class of problem, and one that only needs to be solved once or twice, and already has been, and the solution isn’t cheap.
It doesn't take a marketing genius to note that BaRF is a really bad name for a vision. But it's a really compelling vision, and one I applaud and I look forward to becoming reality.

Oodles Of Bytes

Ross Mayfield: "The latest Berkeley study on the growth of information shows we produced five exabytes during 2002... Its growing at 30% per year... 'We are getting swamped, and we need better ways to organize and manage information. Hopefully, information technology will never replace smart thinking and the human analytical thinking'... Here's something that keeps the Net in perspective, passed on from a friend. Netflix ships 1,500 terabytes worth of information a day. Andrew Odlyzko suggests that daily traffic flow on the Net is 2,000 terabytes. Only 1/3 more than a single CD-ROM rental company."

Music

Reading

  • John Battelle: The Search

    John Battelle: The Search
    My favorite book of 2005. Period.


    (*****)

  • Steven D. Levitt: Freakonomics : A Rogue Economist Explores the Hidden Side of Everything

    Steven D. Levitt: Freakonomics : A Rogue Economist Explores the Hidden Side of Everything
    "Just because two things are correlated does not mean that one causes the other. A correlation simply means that a relationship exists between two factors -- let's call them X and Y -- but it tells you nothing about the direction of that relationship. It's possible that X causes Y; it's also possible that Y causes X; and it may be that X and Y are both being caused by some other factor, Z.

    Economics is, at root, the study of incentives: how people get what they want, or need, especially when other people want or need the same thing.

    Incentives are the cornerstone of modern life. The conventional wisdom is often wrong. Dramatic effects often have distant, even subtle, causes. Experts use their informational advantage to serve their own agenda. Knowing what to measure and how to measure it makes a complicated world much less so." (*****)

  • Malcolm Gladwell: Blink

    Malcolm Gladwell: Blink
    A book of anecdotes about the power of thinking without thinking; this book is a more interesting read than Gladwell's previous, The Tipping Point.

    New York Times: "Gottman believes that each relationship has a DNA, or an essential nature. It's possible to take a very thin slice of that relationship, grasp its fundamental pattern and make a decent prediction of its destiny. Gladwell says we are thin-slicing all the time -- when we go on a date, meet a prospective employee, judge any situation. We take a small portion of a person or problem and extrapolate amazingly well about the whole."

    David Brooks, who wrote that review, adds: "Isn't it as possible that the backstage part of the brain might be more like a personality, some unique and nontechnological essence that cannot be adequately generalized about by scientists in white coats with clipboards?" (*****)

  • Paul Graham: Hackers and Painters

    Paul Graham: Hackers and Painters
    I don't agree with some parts of this book, but I truly loved reading it, and it really made me think. I referenced it in my weblications and superhacker and phoneboy posts. Favorite chapter is How to Make Wealth. (Thanks, Ev.) (*****)

  • Joel Spolsky: Joel on Software

    Joel Spolsky: Joel on Software
    Joel is really good at wielding "diverse and occasionally related matters of interest to software developers, designers, and managers, and those who, whether by good fotune or ill luck, work with them in some capacity."

    Joel on Software embodies the principle of "Welcome to management! Guess what? Managing software projects has nothing at all to do with programming." This book, a compendium of the website's wisdom, is useful for everyone from team leads estimating schedules to software CEOs developing competitive strategy. (*****)

  • Bruce Sterling: Tomorrow Now: Envisioning The Next Fifty Years

    Bruce Sterling: Tomorrow Now: Envisioning The Next Fifty Years
    Bruce wrote this book to come to terms with seven novel aspects of the twenty-first century, situations that are novel to that epoch and no other. It's about future possibilities.

    "This is the future as it is felt and understood: via human experience... The years to come are not merely imaginary. They are history that hasn't happened yet. People will be born into these coming years, grow to maturity in them, struggle with their issues, personify those years, and bear them in their flesh. The future will be lived." Here here, well-spoken, Bruce. (*****)

  • The World's 20 Greatest Unsolved Problems: John Vacca

    The World's 20 Greatest Unsolved Problems: John Vacca
    "Science has extended life, conquered disease, and offered new sexual and commercial freedoms through its rituals of discovery, but many unsolved problems remain...

    If support for science falters and if the American public loses interest in it, such apathy may foster an age in which scientific elites ignore the public will and global imperatives." (*****)

  • Paul Hawken, Amory Lovins, L. Hunter Lovins : Natural Capitalism: Creating the Next Industrial Revolution

    Paul Hawken, Amory Lovins, L. Hunter Lovins : Natural Capitalism: Creating the Next Industrial Revolution
    I had the pleasure recently of meeting Amory Lovins and hearing him talk about Twenty Hydrogen Myths and the design of hypercar. (He also talked about Bonobos... wow.) I'm a convert to the way of thinking espoused in Natural Capitalism. I used to be cynical about the future, but Amory's work has made me a believer that many great things are about to come. The best way to predict the future is to invent it. (*****)

  • Merrill R. Chapman: In Search of Stupidity: Over 20 Years of High-Tech Marketing Disasters

    Merrill R. Chapman: In Search of Stupidity: Over 20 Years of High-Tech Marketing Disasters
    In hilarious prose, this book catalogs lots of stoopid high-tech marketing decisions. It offers clear, detailed analysis of many a marketing mishap, with what happened, why, and how to avoid such stupidity. Might just be the best. book. ever... (*****)

  • Paul Krugman: The Great Unraveling: Losing Our Way in the New Century

    Paul Krugman: The Great Unraveling: Losing Our Way in the New Century
    A book exposing the pitfalls of crony capitalism, from corrupt corporations straight up to the executive branch of our government. Krugman is nonpartisan -- what he exposes is foolish short-term thinking on the part of recent United States policies. The patriotic thing to do, he advises, is to fix these economic problems now before they become much harder to solve.

  • Henry Petroski: Small Things Considered: Why There Is No Perfect Design

    Henry Petroski: Small Things Considered: Why There Is No Perfect Design
    "Design can be easy and difficult at the same time, but in the end, it is mostly difficult." (*****)

  • Alexander Blakely: Siberia Bound

    Alexander Blakely: Siberia Bound
    One of my favorite books of the past few years. Xander is a master storyteller. (*****)

  • Susan Scott: Fierce Conversations

    Susan Scott: Fierce Conversations
    How to make every conversation count. One of my favorite books of the last decade. (*****)

Blog powered by TypePad
Member since 08/2003