I've been thinking a lot lately about Personal Search Engines -- and about Fisher as a category of desktop software that indexes your email (and, as the product evolves, the content in your files, most of which have gotten to your hard drive through email, but also some that arrived through the Web, RSS feeds, instant messaging, and so on). I'm not the only one.
Three days ago, David Weinberger posted an enthusiastic endorsement of X1: "I use it maybe 5 times a day. Now X1 is starting to market itself. Good. It's worth the $100 in time savings alone. It's held up well as my email archive has grown to 110,000 messages." Actually, I don't see a marketing campaign. A scan of Google News for X1 reveals just one press release, which isn't even on X1's press page. However, I have been following X1 for about a year now, and I did download and try their product 3 months ago. And I can see clearly the vision in Nate Koechley's comment:
It's wonderful, and will change how you think about your information. Gone are the days of extensive folder structures in Outlook (or your client of choice). Now, it doesn't matter where the message is, you can always find what you want in the same 2 seconds.This is clearly a compelling vision: a Fisher has the potential to change my life. It is the starting interface I go to for all of my personal information (just as Google is the starting interface I go to for all public information), and it saves me the time I would have otherwise spent organizing email & files and waiting for search queries to complete.
Two days ago, John Battelle cited David's post in his Searchblog, and cites several good reasons why Fisher-as-a-Product-Category solves a problem that a lot of people have, and that gets worse each day:
Desktop search (ie searching your own hard drive) is one of those things that seems to have gotten worse in the past ten years ... I've got 40 gigs, I think, but no desktop search utility (Sherlock doesn't have text string search, far as I can tell). My email, for example, is a thicket of badly organized folders.X1 currently runs only on Windows; John asks if there are any such products available on Mac OSX. The only first-generation Fisher that I can think of for OSX is the beta-grade open source project called Zoë. In reading the comments to John's post, I see that a few people suggested Zoë for searching personal email archives. Like X1, I believe Zoë is good -- good enough to criticize, if not use daily.
First, let me note with due regard to my friend Raphaël Szwarc (Zoë's original visionary and author) and fellow Caltech alum Bill Gross (X1's original visionary) that "good enough to criticize" is actually fairly high praise (as Alan Kay famously noted). There are a lot of technologies that do not qualify as Fishers because they are too raw to be useful to end users: toolkits like Doug Cutting's AIAT (neé V-Twin), Lucene, and Nutch, for example, are not 'complete end user products'; the current Windows XP "Search For Files Or Folders" Dog is silly and slow; the current Outlook 2003 Email Search facility is serious and slow; and grep gets stuck in the throat of anyone who's not a UNIX power user.
Back to X1 and Zoë. In my experience so far -- and remember, I'm a power email user with 20 Gig of email saved over the last 15 years -- neither of products these is good enough to put up with on a regular basis yet.
I believe those who suggested Zoë have never tried actually using it. Rohit Khare and I tried it on a Mac, and we found the system to be a good demonstration of the idea but not something we would actually use as the primary interface to all the information privately available on our desktop (the way we use Google a hundred times a day as the primary interface to all the information publicly available on the Web). The blog-like UI, which seems interesting and novel at first, gets real old real fast. More importantly, as mailbox size increases, Zoë becomes unbearable to use for search because:
(1) Its search results are not complete -- there appears to be an arbitrary cutoff of the total number of items returned;Evaluating X1 on Windows reminded me of my evaluation of Bloomba. They stake their claims to fame on speed (that is, until my email gets to be greater than 100k messages, but apparently this is a problem very few of us have right now). However, their UI's are clunky fat clients that demand I change some aspect of how I work. With X1, I have to give up a half-inch of my screen for an unnecessarily-modal four-tab UI for searching either email text, attachments, files, or contacts; with Bloomba, I have to drop Outlook (or Eudora or Netscape), with all of the charming foibles that I've gotten used to in my mail client over the years.(2) The search results themselves are sufficiently indistinguishable from a random order -- either Zoë is using Lucene poorly, or Lucene works poorly for this application, because even the results in most-recent-order-first would be an improvement;
(3) Searching with Zoë for any mailbox bigger than a few Megabytes is slow -- slow enough that it doesn't transcend Heidegger's categories from 'present-at-hand' to 'ready-to-hand' (like, say, Google itself); and
(4) No simple query syntax like Google offers: not just booleans, but operators like to:adam or subject:cheese or attach:ppt.
In either case, there isn't a decent query syntax -- heck, X1 is no more than 1970's-era KWIC (keyword-in-context) string matching with a bizarre insistence on keystroke-by-keystroke redrawing as if it's trying to nag me into acknowledging its "speed"). Without the kind of excerpting that Google does in step "K", I find myself wading through dozens if not hundreds of hits for most queries... which brings me to a more important point: There is no ranking that makes the results better than grep. At least (unlike Zoë) X1 and Bloomba return matches in order of most-recent-first, but for quality search results, Google has proven to me that ranking is of utmost importance. There is no ranking algorithm a la PageRank that acknowledges even the simplest truths about my mail (stuff from Rohit ranks higher than Orkut notifications, say :).
Many folks can't even imagine 100K messages, but I'm closer to a million (!). Sounds absurd, sure, but the design target for Microsoft Longhorn is supposedly 1 terabyte PCs! And that speaks to another basic criticism of all the aforementioned tools: email may be the center of my universe, but not the entirety of it. How about an "image search" of my hard drive that didn't require me to laboriously pre-caption each photo? Or a "version search" of our latest spec sheet that doesn't trip up on the fact that there are 32 separate Word attachments that all contain the same paragraph over the last year? Or a way to search all the web pages I've visited before? There's hard drive to spare -- why not cache everything?
To put it bluntly, Zoë and X1 are good first-generation manifestations of Fisher As A Product Category. But this product category must evolve before these products are a must-have for everyone who feels the pain of finding private information in-my-email-and-on-my-hard-drive:
In Summary
Google has shown me I can have it all: fast, ranked search with a simple UI and a rich query language. Is it too much to ask for being able to have that kind of search for my personal data the way I can already search the public web?
By the by, several other folks in the ensuing discussion linked to projects I don't consider "good enough to criticize" (yet):
Jwz's Intertwingle insight is still a manifesto with which I wholeheartedly agree, but somehow it hasn't magically been implemented spontaneously by the open source community. Perhaps Chandler will do better -- it has great architecture plans -- but in my opinion its attention is too focused on competing with Outlook-the-GUI.
Launchbar is pretty much like locate(8) for non-UNIX types -- I love the single-keystroke access to its ultra-minimal UI, but I'm not a developer enough to be excited about searching .h files for method names. I need to search restaurant names in unstructured text...
And Spring and Scopeware, for all their promise, have even clunkier fat-client UI's than X1 and Bloomba. I don't need any more User Interfaces in my life!
Hi Adam,
(1) "Its search results are not complete"
There is indeed a cutoff. Only the top x hits are returned. This is under the user control though. On the other hand, what good one million hits will do to you?
(2) "random order"
This is in the eyes of the beholder :) On the other hand, you can sort any result in any way you want.
(3) slow enough
Hmmm... perhaps... YMMV. On the other hand, with over half-a-million messages in my instance running on a humble laptop, it's fast enough.
(4) "simple query syntax"
"Simple query syntax"? Is that an oxymoron? :) In the meantime, the full Lucene syntax is supported. On the other hand:
"If you like regular expressions, boolean searches and SQL queries, this it not for you. If you thrive in complexity, just stay away. The point here is to make complex thing simple (and to keep simple thing simple). Not the other way around."
-- Unknown
http://zoe.nu/itstories/story.php?data=stories&num=16&sec=1
"I can have it all"
No, you can't. At least not what you seem to be implying by mixing Google in the mix. Something as to give one way or another :)
Posted by: Zoe | March 07, 2004 at 07:08 AM
Mr. Zoë awakes -- cool to see you around! :-)
As for user-configurable, I hadn't noticed, but that's understandable. I don't live with it enough to have discovered that feature. However, returning "1M results" is *exactly* what I need. As an example, I'd like my own results for searching FoRK.mbox to rival what I get from Google's xent.com crawl of the same bits.
That said, the key is ranking: only the first 5-7 search results are really worthwhile to most users. [That said, there is a school of thought that users may be more persistent in searching their own email, since they may be "sure I've seen it before, d*mnit!"]
So one of the options facing us is just hacking Zoë to experiment with adding a ranker, but the license makes us a bit wary -- not that we've found a commercial revenue model that makes sense for Fishers yet! :-)
Posted by: Rohit | March 07, 2004 at 12:33 PM
Hi Rohit,
Mr. Zoë awakes -- cool to see you around! :-)
Busy dealing with some pesky Mexican revolutionary ;)
As for user-configurable, I hadn't noticed, but that's understandable. I don't live with it enough to have discovered that feature. However, returning "1M results" is *exactly* what I need.
Hmmmm... how would that be useful in any practical sense?
As an example, I'd like my own results for searching FoRK.mbox to rival what I get from Google's xent.com crawl of the same bits.
I think that continuously referring to Google is misleading as one of the fundamental flaw with Google & Co. is that they are "bottomless" so to speak. There is also things that one can do in a computer farm that are not realistic to envision on the client side. And vis-versa.
That said, the key is ranking: only the first 5-7 search results are really worthwhile to most users. [That said, there is a school of thought that users may be more persistent in searching their own email, since they may be "sure I've seen it before, d*mnit!"]
Yes, ranking is important... but relative... keep in mind there is no "one and only one true, universal ranking"... PageRank, MessageRank or PeopleRank are not the definitive answer no matter how you look at it.
So one of the options facing us is just hacking Zoë to experiment with adding a ranker,
For what you have in mind, the answer may be in the SZLink class :)
but the license makes us a bit wary
Drop me a line. I sure we could work out something :)
-- not that we've found a commercial revenue model that makes sense for Fishers yet! :-)
Cheers,
Z.
Posted by: Zoe | March 07, 2004 at 02:21 PM
Many of the points you made are subsumed in this one:
"think that continuously referring to Google is misleading as one of the fundamental flaw with Google & Co. is that they are "bottomless" so to speak. There is also things that one can do in a computer farm that are not realistic to envision on the client side. And vis-versa."
I disagree. First, on the technical side: client-side PC power just continues to increase inexorably to ridiculous proportions, while the speed of human reading is only constant -- I think one back-of-the-envelope measure is that if the fast speed-reader read continuously for a century, that still wouldn't even be 3GB of English text (uncompressed!). Processor, memory, and spare disk space will soon handily eclipse the powers of the earliest-generation search engines -- decade-old technology. (Heck, anyone remember Brian Pinkerton's very first WebCrawler, which sold to AOL for the then-unimaginable $1M; prototyped on NeXTstep using Digital Librarian?)
Second, though, I believe that my personal archive *is* "bottomless". [Not to mention that some of it may be topless, too :-] Not only in the minimal sense that I already have over 1M files and emails on my current laptop (literally true!) but also that I'd like to keep a copy of everything I've ever read, subscribe to all those RSS feeds, and thus keep copies of everything I'm *likely* to read -- my own offline Internet (keyword: rion).
This brings us back to ranking -- you're right, literally paging through 1M results is useless. And there is no one perfect ranking algorithm. "Goodness" is a teleological conundrum: is what-is-good what you are likely-to-read? Is what you are likely-to-read determined by what you have-already-read? It does still fall to a particular auteur to propose a hypothesis for ranking, and for users -- initially, packrats and professionals (say, journalists) -- to dispose...
Good luck with the Zappata revolution!
(see, at least Google can find this page, now that I used the magic word! :-)
Posted by: Rohit Khare | March 08, 2004 at 10:18 AM
I disagree. First, on the technical side: client-side PC power just continues to increase inexorably to ridiculous proportions, while the speed of human reading is only constant -- I think one back-of-the-envelope measure is that if the fast speed-reader read continuously for a century, that still wouldn't even be 3GB of English text (uncompressed!). Processor, memory, and spare disk space will soon handily eclipse the powers of the earliest-generation search engines -- decade-old technology. (Heck, anyone remember Brian Pinkerton's very first WebCrawler, which sold to AOL for the then-unimaginable $1M; prototyped on NeXTstep using Digital Librarian?)
In principle, I agree. In fact this is one of the tenet of ZOE, ZAPPATA & Co.: "The Personal Server".
"Take a closer look at your latest PC (that use to stand
for Personal Computer). It has hundreds of megabytes of memory.
Uncountable gigabytes of disk space. Lightning fast processor. Fast
internet connection. And what does it do, sitting alone by night? It
looks for alien life forms..."
-- Unknown, Practically Speaking
Second, though, I believe that my personal archive *is* "bottomless". [Not to mention that some of it may be topless, too :-]
I disagree. Your data is not bottomless. While significant, this data has nonetheless been filtered by you, Rohit, one way or another. In the case of Google, this is not the case: Google & Co. aspire to universality and are therefore indiscriminate. And in the process a lot of context get lost.
Not only in the minimal sense that I already have over 1M files and emails on my current laptop (literally true!) but also that I'd like to keep a copy of everything I've ever read, subscribe to all those RSS feeds, and thus keep copies of everything I'm *likely* to read -- my own offline Internet (keyword: rion).
Aha! Here we go: this is the fundamental difference between Google and, gasp, ZAPPATA for example. In principle, they do the same thing: collecting and finding staff. In practice, ZAPPATA has the added benefit of, er, "social filtering". This make a huge difference in the quality of your data and therefore your search ranking.
This brings us back to ranking -- you're right, literally paging through 1M results is useless. And there is no one perfect ranking algorithm. "Goodness" is a teleological conundrum: is what-is-good what you are likely-to-read? Is what you are likely-to-read determined by what you have-already-read? It does still fall to a particular auteur to propose a hypothesis for ranking, and for users -- initially, packrats and professionals (say, journalists) -- to dispose...
Yes, ranking is very important. But the quality of your data is fundamental also.
Good luck with the Zappata revolution!
Thanks :)
(see, at least Google can find this page, now that I used the magic word! :-)
Perhaps. But how helpful would that be if it's on the page 5,345,654th page?
Cheers,
Z.
Posted by: Zoe | March 09, 2004 at 01:55 AM
Hi from a non-techie. (My you all sound SMART. I'm just a lowly writer in New Jersey.) I like the Cat in the Hat hat, btw. Found you via Google because we both read the book "Linked" and I was charmed by the Intertwingled title. I've just downloaded an X1 trial.... it's 23 percent through my files now, so we'll see how well it works when it's done.
Given your interest in links and intertwingling, you may appreciate reading about my misadventure with Technorati today.
Posted by: Deb | March 09, 2004 at 09:36 PM
Ok, shameless self-promotion. Check out FILEhand Search. It's modeled after Google in that it returns results sorted by relevance, and you can search using phases, and AND, OR, and NOT booleans.
Good news/bad news: bad news: no email support -- yet. It's coming.
good news: the original intent was to support file searching, especially PDF, office, MP3 tags, etc. If you are looking for some information in, say, one of 15,000 PDF files, FILEhand Search will sort the results by relevance and show a scrollable text extract of what you were looking for. You likely wouldn't even have to open a PDF reader. And, it costs $39.
OK, so I'm the co-founder of Filehand. Did I do a bad thing by posting here? I don't know myself. But, after we support email in a few weeks, I think you'll find FILEhand Search a closer analogy to Google for the desktop.
Posted by: Elliot | March 31, 2004 at 04:45 PM
What would a Fisher look like if you deconstructed the "search engine", shredded apart and flipped over so to speak, solving search in terms of coordination instead of indexing? Treat indexing as an active filtering problem so that topics such as /people/adam/terms/intertwingled/ maintains a collection of tuples referring to the emails sent by Adam that contain the term intertwingled. Perhaps the UI could map basic search terms into route configurations to return the more ephemeral structure that a query result would entail. Does this make sense?
In general can't help but be fascinated by the idea of programming in the small much along the lines of Clay's piece on Situated Software http://www.shirky.com/writings/situated_software.html . A Fisher seems to be an interesting problem for this sort of thinking.
Posted by: Robert | April 02, 2004 at 03:25 AM
Left unmentioned here is Nelson - see www.caelo.com. It is way way better than a "Google for Email", which it does as well as any other IR-based search of text, but also because it also takes into account that:
a) People use Email to manage tasks
b) People use Email to store documents
c) People work on a project basis
d) People look up email by people, and by time period
Nelson handles all this but doesn't get in the way of your Outlook workflow.
BTW, jwz "invented" nothing. This is a very old topic.
Posted by: Gary | April 17, 2004 at 09:41 PM