In April 2004 during a description of the (now-part-of-Looksmart) Furl service, John Battelle defined Personal Web:
Furl saves the entire web page you've "furled", not just the URL, which prevents link rot, on the one hand, and creates what I'll call a "PersonalWeb," on the other.
Now, having your own PersonalWeb is a very cool thing. Every page you care about is now saved forever, and is searchable. How I wish I had Furl while I was researching my book for the past year. This application was inconceivable before the cost of storage and bandwidth began to fall toward zero.
But wait...there's more. You can share your PersonalWeb with others. And Mike just added a recommendation engine, so you can see links the service thinks will be interesting to you, based on what you've already Furl'd. Now, let's play this out. Imagine Furl on, oh, Yahoo, for example. Or Google. You now have a massively scaled application where millions of people are creating their own personal versions of the web, and then sharing them with each other, driving massively statistically significant recommendations, and...some pretty damn useful metadata that can be fed into search engine algorithms, resulting in...yup, far better search (and...far better SFO (Search Find Obtain) opportunities).
Speaking of SFO, imagine the business model. (Mike has, trust me.) If you have a system that has stored millions of people's PersonalWebs, webs they have literally voted for by *taking action* and *saving* or even *annotating*, then it's not such a trick to apply some contextual advertising mojo to the whole lot. After all, Web 2.0 is built on the premise that taking action - voting, in effect - can create scaled value (the best known expression of a scaled "voting system" is Page Rank- a link is a vote.)
(Aside #1: That last paragraph reminds me of NudeCybot's post Blogdex: a modern oracle or the new Pagerank? (inspired by Biz Stone's The Wisdom of Blogs): "There is a certain collective intelligence when you have the power of numbers, diversity of opinion, and freedom of both medium and content... This is strikingly similar to the process of biological evolution whose effectiveness in what I would describe as life itself solving the problem of staying alive is related to population size, phenotypic variation and its underlying genetic variation, and availability of diverse environments with subtle variations.")
In September 2004 John revisited Personal Webs in his description of a new Ask service:
MyJeeves allows you to save results, annotate them, and then manage them in your own personal folders. Those results (and the annotation) are then searchable (as they are with A9)... Once you do, you get unlimited storage of saved results, and... pay attention here... your search history... It's like creating your own web index... the integration of search with perfect copies of what I've seen on the web...
(Aside #2: That article also talks about the imminent release of the (now-part-of-Ask) Tukaroo product integrating web search with my hard drive, which continues to push forward the evolution of Fisher as a product category.)
John's article then paints a compelling picture where a world of connected Personal Webs -- with the derivatives of information in the Personal Web also being thrown into the Information Soup that is the Global Web -- creates a virtuous cycle that keeps generating and refining information:
None of these features [Personal Webs and Fishers being two of them] are big enough to warrant a Google Moment like we had in 2000-2001. However, they all point to an incredibly robust future - and by the way, a future in which personal publishing is very much integrated into search, and vice versa. Just a thought, but once a critical mass of folks are saving searches, search results, annotations, and the like, sure as shit they'll want to share them, publish them, and cite them (and sure as shit, engines will want to crawl em for relevance mojo). Just watch as search, blogging, and RSS start to feed off one another.
Let's put aside copyright issues -- I'm assuming the New York Times, for example, has as much problem with me saving a personal copy of every (copyrighted) page I've read on nytimes.com as the RIAA has with my ripping mp3's from my CD collection (no DRM?? for shame!) -- and stick to the issue of whether my Personal Web is big? How much work would a Fisher that indexes my Hard Drive and my Personal slice of the public Web be?
Back-of-the-envelope calculation #1: PersonalWeb.
An approximation of the size of my Personal Web is a function of how many pages I've read, how long I've been reading, and what the average page size is. That is,
Personal Web Size = Avg # Pages/Day x # Days x Avg Page Size
I've been on the Web for almost 12 years. Assume a high estimate of the average number of web pages I've visited per day, like 100. I don't know what average web page size is, but average web page size used to be 60k a few years ago. Let's call it 100k now to account for bigger images now available due to wider broadband adoption, knowing full well that like the average of 100 pages a day I read, the estimate is high. That would make an upper bound on the size of the Personal Web I've personally seen is 12 x 365.25 x 100 x 100k = 4383 days x 100 pages/day x 100,000 bytes/page = 43,830,000,000 bytes, or less than 44 Gigabytes.
Wow. I already have more than that space available today on my hard drive (and/or in my 100 Gigabyte email account, so if I had a Personal Proxy that seamlessly stashed a copy of every Web page I've ever seen, I could easily store it today, and then I just have to focus on how to index and search it.
Back-of-the-envelope calculation #2: PersonalWeb++.
Assume we live in The World Of Tomorrow, where hard drives have many Terabytes of storage. Even though I'm regularly slurping down huge amounts of content via BitTorrent and eMule -- gotta love decentralized filesharing -- I still have plenty of space (and CPU power) not only to keep on my hard drive a copy of my entire Personal Web (it grows at 3.65 Gigabytes a year), but have my own Personal Spider that regularly crawls, on my behalf, the entire contents of any site I've ever visited (including but not limited to all my favorite blogs and RSS feeds), without ever clobbering any of the content it has ever fetched for me, so that nothing ever gets lost in the ether of, say, Archive, where I can never find it again. How many websites have I visited over the years? Let's be conservative and say I visit an average of 50 websites a day, year after year, but how many of those are unique? Let's conservatively say I discover on average 10 new web sites a day -- and I know that number is high if I average it over decades. How big is the average website? Let's be really conservative and say that it's 1000 pages. Using our earlier assumption of 100k per page, I have 10 new sites/day x 1000 pages/site x 100k bytes/page, or 1 Gigabyte per day on average of new material to add to the store of all information around the periphery of my Personal Web. Over 12 years (or 4383 days) the upper bound on the amount of storage I would have needed to save my PersonalWeb++ is therefore less than 4.4 Terabytes. Uncompressed. Which seems small if I have a 100 Terabyte hard drive at my disposal.
So maybe a Fisher that indexes and offers up PersonalWeb++ is a big idea after all. At least if storage is the metric by which we measure the bigness of an idea. Which gets me wondering...
Back-of-the-envelope calculation #3: How big is the Web today?
Four years ago, BrightPlanet did a deep web study, and determined that "Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web. The deep Web contains 7,500 terabytes of information, compared to 19 terabytes of information in the surface Web. The deep Web contains nearly 550 billion individual documents compared to the 1 billion of the surface Web. More than an estimated 100,000 deep Web sites presently exist. Sixty of the largest deep Web sites collectively contain about 750 terabytes of information – sufficient by themselves to exceed the size of the surface Web by 40 times."
Fast forward some. 18 months ago a blogger wrote, "Google has indexed over 3 billion. I thought the common wisdom was that there are over 20 billion pages."
How about today? Well, Google says it currently indexes 4.3 billion web pages. Let's assume for fun that they only have 10% of the actual existing Global Web in their wonderful little paws. 43 billion web pages times an average of 100k per page is 86 quadrillion bytes (what's that, 4300 terabytes or 4.3 petabytes?)... sounds like a lot.
We've certainly come a long way since 1993, and we still have a long way to go before we get to yottabytes.
All this talk of big numbers reminds me of the excellent ACM Queue article by Peter Lyman and Hal Varian, How much storage is enough?, with oodles of nice soundbites (sound bytes?):
In 1999, the world produced about 1.5 exabytes of storable content (1018 bytes). This is 1.5 billion gigabytes, and is equivalent to about 250 megabytes for every man, woman, and child on earth. Printed documents of all kinds make up only .003 percent of the total...
The tough part is digital content, though that's the most important component. According to our estimates, more than 90 percent of information content is born digital these days. Conversion to ASCII, MP3, MPEG, and other compression technologies dramatically reduces storage requirements by one to two orders of magnitude.
If all printed material published in the world each year were expressed in ASCII, it could be stored in less than 5 terabytes...
We expect that digital communications will be systematically archived in the near future and thus will contribute to the demand for storage. In 1999 we estimated e-mail to be about 12 petabytes per year, Usenet about 73 terabytes, and the static HTML Web about 20 terabytes. Many Web pages are generated on-the-fly from data in databases, so the total size of the "deep Web" is considerably larger.
They updated the survey in 2003 with the wonderful site How Much Information?:
The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.
Ninety-three percent of the information produced each year is stored in digital form. Hard drives in stand-alone PCs account for 55% of total storage shipped each year.
Over 80 billion photographs are taken every year, which would take over 400 petabytes to store, more than 80 million times the storage requirements for text.
One terabyte, the smallest practical measure for our project, is a million megabytes, which is equivalent to the textual content of a million books. An exabyte, which is what we use to report the final results, is a billion gigabtyes.
There are plenty more good nuggets in there -- see the sound bytes, charts, and summary.

But reading that material got me wondering. "If all printed material published in the world each year were expressed in ASCII, it could be stored in less than 5 terabytes." Can you imagine the day in the future where everything published in the entire year were shipped to anyone who wanted it? (Again, ignore copyrights.) In less than 100 years will we have a hard drive small enough to carry around yet big enough to store everything anyone ever published? That's when Personal Webs -- not just the annotations of what I've read and wrote, but the annotations of people I trust -- can guide the probable answer I'm seeking by providing tips to search engine about the kind of information I like.
I leave with a final thought. How much stuff can I read in a lifetime? (Not just web, but books, and magazines, and the backs of cereal boxes... how many bytes will that be?)
Back of the envelope calculation #4: How much do I personally read?
Assume I'm a whiz who speed reads, so I can read a page of text in a minute, and that a page of text is 250 words. I'm college-fed, so assume that an average word in my kind of reading is 8 characters. Then my average page is 2000 bytes of raw ASCII, or 2K. How many hours a day do I spend reading? Assume I'm an egghead who can multitask, so I average a solid 6 hours of reading per day, whether at work or at play or in meetings, or what have you. 6 x 60 = 360 minutes of reading per day. That's 720k per day, which admittedly seems like an upper bound. How long will I live? Well, I have an unhealthy lifestyle, but living long runs in my family -- I have four grandparents who all passed 80, and three who passed 90 -- so assume I live 90 years, and chop off 10 because I was a slacker when I was young. 80 years of reading times 365.25 days per year is 29,220 days of reading before I kick the bucket. And that's if I'm lucky. But we're interested in establishing upper bounds here, so it'll do. 29,220 days times 720k per day is 21,038,400 kilobytes that I could read in a lifetime. In other words, with my wildly conservative estimates an upper bound on the amount of text I could possibly read in a lifetime is 21 Gigabytes.
Personal Webs with recommender systems that take into account what I like to read and write, and what people I trust like to read and write, is the only way to make sure those 21 Gigabytes count.
Recent Comments