In April 2004 during a description of the (now-part-of-Looksmart) Furl service, John Battelle defined Personal Web:
Furl saves the entire web page you've "furled", not just the URL, which prevents link rot, on the one hand, and creates what I'll call a "PersonalWeb," on the other.
Now, having your own PersonalWeb is a very cool thing. Every page you care about is now saved forever, and is searchable. How I wish I had Furl while I was researching my book for the past year. This application was inconceivable before the cost of storage and bandwidth began to fall toward zero.
But wait...there's more. You can share your PersonalWeb with others. And Mike just added a recommendation engine, so you can see links the service thinks will be interesting to you, based on what you've already Furl'd. Now, let's play this out. Imagine Furl on, oh, Yahoo, for example. Or Google. You now have a massively scaled application where millions of people are creating their own personal versions of the web, and then sharing them with each other, driving massively statistically significant recommendations, and...some pretty damn useful metadata that can be fed into search engine algorithms, resulting in...yup, far better search (and...far better SFO (Search Find Obtain) opportunities).
Speaking of SFO, imagine the business model. (Mike has, trust me.) If you have a system that has stored millions of people's PersonalWebs, webs they have literally voted for by *taking action* and *saving* or even *annotating*, then it's not such a trick to apply some contextual advertising mojo to the whole lot. After all, Web 2.0 is built on the premise that taking action - voting, in effect - can create scaled value (the best known expression of a scaled "voting system" is Page Rank- a link is a vote.)
(Aside #1: That last paragraph reminds me of NudeCybot's post Blogdex: a modern oracle or the new Pagerank? (inspired by Biz Stone's The Wisdom of Blogs): "There is a certain collective intelligence when you have the power of numbers, diversity of opinion, and freedom of both medium and content... This is strikingly similar to the process of biological evolution whose effectiveness in what I would describe as life itself solving the problem of staying alive is related to population size, phenotypic variation and its underlying genetic variation, and availability of diverse environments with subtle variations.")
In September 2004 John revisited Personal Webs in his description of a new Ask service:
MyJeeves allows you to save results, annotate them, and then manage them in your own personal folders. Those results (and the annotation) are then searchable (as they are with A9)... Once you do, you get unlimited storage of saved results, and... pay attention here... your search history... It's like creating your own web index... the integration of search with perfect copies of what I've seen on the web...
(Aside #2: That article also talks about the imminent release of the (now-part-of-Ask) Tukaroo product integrating web search with my hard drive, which continues to push forward the evolution of Fisher as a product category.)
John's article then paints a compelling picture where a world of connected Personal Webs -- with the derivatives of information in the Personal Web also being thrown into the Information Soup that is the Global Web -- creates a virtuous cycle that keeps generating and refining information:
None of these features [Personal Webs and Fishers being two of them] are big enough to warrant a Google Moment like we had in 2000-2001. However, they all point to an incredibly robust future - and by the way, a future in which personal publishing is very much integrated into search, and vice versa. Just a thought, but once a critical mass of folks are saving searches, search results, annotations, and the like, sure as shit they'll want to share them, publish them, and cite them (and sure as shit, engines will want to crawl em for relevance mojo). Just watch as search, blogging, and RSS start to feed off one another.
Let's put aside copyright issues -- I'm assuming the New York Times, for example, has as much problem with me saving a personal copy of every (copyrighted) page I've read on nytimes.com as the RIAA has with my ripping mp3's from my CD collection (no DRM?? for shame!) -- and stick to the issue of whether my Personal Web is big? How much work would a Fisher that indexes my Hard Drive and my Personal slice of the public Web be?
Back-of-the-envelope calculation #1: PersonalWeb.
An approximation of the size of my Personal Web is a function of how many pages I've read, how long I've been reading, and what the average page size is. That is,
Personal Web Size = Avg # Pages/Day x # Days x Avg Page Size
I've been on the Web for almost 12 years. Assume a high estimate of the average number of web pages I've visited per day, like 100. I don't know what average web page size is, but average web page size used to be 60k a few years ago. Let's call it 100k now to account for bigger images now available due to wider broadband adoption, knowing full well that like the average of 100 pages a day I read, the estimate is high. That would make an upper bound on the size of the Personal Web I've personally seen is 12 x 365.25 x 100 x 100k = 4383 days x 100 pages/day x 100,000 bytes/page = 43,830,000,000 bytes, or less than 44 Gigabytes.
Wow. I already have more than that space available today on my hard drive (and/or in my 100 Gigabyte email account, so if I had a Personal Proxy that seamlessly stashed a copy of every Web page I've ever seen, I could easily store it today, and then I just have to focus on how to index and search it.
Back-of-the-envelope calculation #2: PersonalWeb++.
Assume we live in The World Of Tomorrow, where hard drives have many Terabytes of storage. Even though I'm regularly slurping down huge amounts of content via BitTorrent and eMule -- gotta love decentralized filesharing -- I still have plenty of space (and CPU power) not only to keep on my hard drive a copy of my entire Personal Web (it grows at 3.65 Gigabytes a year), but have my own Personal Spider that regularly crawls, on my behalf, the entire contents of any site I've ever visited (including but not limited to all my favorite blogs and RSS feeds), without ever clobbering any of the content it has ever fetched for me, so that nothing ever gets lost in the ether of, say, Archive, where I can never find it again. How many websites have I visited over the years? Let's be conservative and say I visit an average of 50 websites a day, year after year, but how many of those are unique? Let's conservatively say I discover on average 10 new web sites a day -- and I know that number is high if I average it over decades. How big is the average website? Let's be really conservative and say that it's 1000 pages. Using our earlier assumption of 100k per page, I have 10 new sites/day x 1000 pages/site x 100k bytes/page, or 1 Gigabyte per day on average of new material to add to the store of all information around the periphery of my Personal Web. Over 12 years (or 4383 days) the upper bound on the amount of storage I would have needed to save my PersonalWeb++ is therefore less than 4.4 Terabytes. Uncompressed. Which seems small if I have a 100 Terabyte hard drive at my disposal.
So maybe a Fisher that indexes and offers up PersonalWeb++ is a big idea after all. At least if storage is the metric by which we measure the bigness of an idea. Which gets me wondering...
Back-of-the-envelope calculation #3: How big is the Web today?
Four years ago, BrightPlanet did a deep web study, and determined that "Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web. The deep Web contains 7,500 terabytes of information, compared to 19 terabytes of information in the surface Web. The deep Web contains nearly 550 billion individual documents compared to the 1 billion of the surface Web. More than an estimated 100,000 deep Web sites presently exist. Sixty of the largest deep Web sites collectively contain about 750 terabytes of information – sufficient by themselves to exceed the size of the surface Web by 40 times."
Fast forward some. 18 months ago a blogger wrote, "Google has indexed over 3 billion. I thought the common wisdom was that there are over 20 billion pages."
How about today? Well, Google says it currently indexes 4.3 billion web pages. Let's assume for fun that they only have 10% of the actual existing Global Web in their wonderful little paws. 43 billion web pages times an average of 100k per page is 86 quadrillion bytes (what's that, 4300 terabytes or 4.3 petabytes?)... sounds like a lot.
We've certainly come a long way since 1993, and we still have a long way to go before we get to yottabytes.
All this talk of big numbers reminds me of the excellent ACM Queue article by Peter Lyman and Hal Varian, How much storage is enough?, with oodles of nice soundbites (sound bytes?):
In 1999, the world produced about 1.5 exabytes of storable content (1018 bytes). This is 1.5 billion gigabytes, and is equivalent to about 250 megabytes for every man, woman, and child on earth. Printed documents of all kinds make up only .003 percent of the total...
The tough part is digital content, though that's the most important component. According to our estimates, more than 90 percent of information content is born digital these days. Conversion to ASCII, MP3, MPEG, and other compression technologies dramatically reduces storage requirements by one to two orders of magnitude.
If all printed material published in the world each year were expressed in ASCII, it could be stored in less than 5 terabytes...
We expect that digital communications will be systematically archived in the near future and thus will contribute to the demand for storage. In 1999 we estimated e-mail to be about 12 petabytes per year, Usenet about 73 terabytes, and the static HTML Web about 20 terabytes. Many Web pages are generated on-the-fly from data in databases, so the total size of the "deep Web" is considerably larger.
They updated the survey in 2003 with the wonderful site How Much Information?:
The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.
Ninety-three percent of the information produced each year is stored in digital form. Hard drives in stand-alone PCs account for 55% of total storage shipped each year.
Over 80 billion photographs are taken every year, which would take over 400 petabytes to store, more than 80 million times the storage requirements for text.
One terabyte, the smallest practical measure for our project, is a million megabytes, which is equivalent to the textual content of a million books. An exabyte, which is what we use to report the final results, is a billion gigabtyes.
There are plenty more good nuggets in there -- see the sound bytes, charts, and summary.
But reading that material got me wondering. "If all printed material published in the world each year were expressed in ASCII, it could be stored in less than 5 terabytes." Can you imagine the day in the future where everything published in the entire year were shipped to anyone who wanted it? (Again, ignore copyrights.) In less than 100 years will we have a hard drive small enough to carry around yet big enough to store everything anyone ever published? That's when Personal Webs -- not just the annotations of what I've read and wrote, but the annotations of people I trust -- can guide the probable answer I'm seeking by providing tips to search engine about the kind of information I like.
I leave with a final thought. How much stuff can I read in a lifetime? (Not just web, but books, and magazines, and the backs of cereal boxes... how many bytes will that be?)
Back of the envelope calculation #4: How much do I personally read?
Assume I'm a whiz who speed reads, so I can read a page of text in a minute, and that a page of text is 250 words. I'm college-fed, so assume that an average word in my kind of reading is 8 characters. Then my average page is 2000 bytes of raw ASCII, or 2K. How many hours a day do I spend reading? Assume I'm an egghead who can multitask, so I average a solid 6 hours of reading per day, whether at work or at play or in meetings, or what have you. 6 x 60 = 360 minutes of reading per day. That's 720k per day, which admittedly seems like an upper bound. How long will I live? Well, I have an unhealthy lifestyle, but living long runs in my family -- I have four grandparents who all passed 80, and three who passed 90 -- so assume I live 90 years, and chop off 10 because I was a slacker when I was young. 80 years of reading times 365.25 days per year is 29,220 days of reading before I kick the bucket. And that's if I'm lucky. But we're interested in establishing upper bounds here, so it'll do. 29,220 days times 720k per day is 21,038,400 kilobytes that I could read in a lifetime. In other words, with my wildly conservative estimates an upper bound on the amount of text I could possibly read in a lifetime is 21 Gigabytes.
Personal Webs with recommender systems that take into account what I like to read and write, and what people I trust like to read and write, is the only way to make sure those 21 Gigabytes count.
one question: do they really need to count for you to be happy? i've noticed recently (maybe it's not really new) a branch of blogging devoted to self-optimisation, think of brain overclocking if you will, and then i think about pekka himanen writing in hacker's ethic of hackers breaking out of bounds of the iron cage and i say to myself, they're not breaking, they're adding extra bars from inside.
Posted by: alek | October 03, 2004 at 12:04 AM
I enjoyed your back of the envelope calculations. Aside from reassuring me that the businesses of storing, transmitting, sorting and searching information will probably continue to grow I wondered: When information is dirt cheap to store, transmit and share how do you control it?
You just got the ball rolling for me on what became far too long a ramble to post here. If you're interested you can read it here:
http://www.nudecybot.net/2004/10/infreemation-revolution.html
Posted by: NudeCybot | October 05, 2004 at 10:20 PM
Alek, point well taken. They don't really need to count for me to be happy. However, I am always confronted with a thousand or more things I want to read, and I'm never going to be able to read them all. And so I continue to search for a means by which to prioritize what would be best for me to read next.
NudeCybot, your post is quite thought-provoking. I already know of a company that was formed for the express purpose of filing patents that could then be sold or licensed.
Perhaps to counter that we need to patent a machine that stores all combinations of 0's and 1's to a googol of bits, and patent them all -- and then show how anything anyone tries to patent from then on maps to one of the bit sequences we've already patented... perhaps the best way to level the playing field is to destroy it.
Posted by: Adam | October 11, 2004 at 12:44 PM
Greg Linden: "If you want to get started building your personal web, take a look at Seruku and Recall Toolbar."
Also, I'm very excited today. Fluffy Bunny (aka Google Desktop) has been a thrilling tool to use... instant gratification...
Posted by: Adam | October 14, 2004 at 12:45 PM
(I just found some notes from Rohit written just after I wrote this post. I'm including it here because I think it offers a compelling vision...)
Personal Web Platform
Battelle's Web 2.0 Conference is next week. The theme is "the web as
platform", the trend of applications moving from stand-alone, with
local data, to networked, with shared, remote data. As this happens,
the details of one's local operating system become less relevant. All
you'll need is a web browser, and perhaps a select few other web-based
applications. With less stuff required of local applications, this
forms a threat to Microsoft's desktop dominance. My concern is whether
another company might replace Microsoft's desktop monopoly with a
"web OS" monopoly.
In this web-based world, I'd like to keep all my personal data
remotely, so that I can access it equally well from a Linux
workstation, an Apple laptop, a Palm phone and a Windows-based
internet-access terminal. Still, I'd like to leverage my local
resources. For example, my laptop and handheld should be able to
access my data while offline, and my workstation should be able to
search it quickly using a local database.
Another big advantage of storing data remotely is that, if my laptop
hard drive fails, or I get a new workstation, I don't have to worry
about copying all my stuff. I just log into my personal web and,
voila, all my stuff is right there.
As implied above, I should be able to search all my data. Apple's
Spotlight will search my local private data on a Mac, as will Gnome's
Beagle and Longhorn on Windows. But synchronizing my data across these
platforms is still a pain, so these don't really solve the problem.
What's needed to make this possible? We already have standards for
accessing most remote data. Email can be accessed with IMAP, Address
books with LDAP, files with WebDAV, chat with Jabber, etc. There are
even providers for most of these services, and each has clients for
most platforms. So what's missing? A little glue, I think.
We need a standard way to store bookmarks, history and other personal
meta data (like the names of your mail, instant messaging, and WebDav
accounts) and a standard way to intercept personal data so that it can
be indexed and searched, either locally or remotely. But who will
build the glue?
I think this has to be a cross-platform open-source project. It has
web server to search and configure things. Applications can contact ittoo much potential as a chokepoint to be entrusted to a commercial party.
Inspired by ideas around Fisher, much of this could be achieved with a
proxy server. It could proxy HTTP, IMAP, POP, LDAP, Jabber, etc.,
transparently indexing and caching things. One could connect to it as a
through HTTP to get their configuration data. It can expose web
services APIs for all this too, so that native applications can be
built for search, etc. If we could, e.g., get Mozilla and other
desktop applications to look for the daemon on install, and, when it's
present, configure themselves through it by default, then all that one
should need to do on a new machine is tell the daemon where on the web
your personal configuration lives, and you're good to go. With one
step, your files, address book, bookmarks, cookies, logins, email
configuration, etc. would all be there.
The daemon would mostly be a framework for plugins. For example,
search needn't be hard-wired into it, it should be a plugin. Different
vendors might provide different personal search applications.
Similarly, a spam-detectors could easily be plugged into the email
processing pipeline, etc.
(Editor's note: sounds like Google Desktop is a great first step toward a Personal Proxy...)
Posted by: Adam | October 27, 2004 at 08:21 PM
John Markoff on the current size of The Google Web:
Posted by: Adam | November 10, 2004 at 06:31 PM
Ten Reasons Why on Personal Web:
Well stated.I also found this nice table on the Size of the Prize from Charles Ferguson's article in the MIT Tech Review, What's Next for Google.
Also, I found Jeremy's post on email and browser URL extraction and search to be particularly interesting in how it relates to one's Personal Web.
Sounds like some kind of wonderful.
Posted by: Adam | December 21, 2004 at 03:52 PM
Hi Adam, I have stumbled upon your idea on Peronsal web. I must introduce you the open source personal search engine project MindRetrieve that I have launched. It does fairly close to what you describe, a http proxy that saves and let you search everything you've read. I did some math myself and am quite convinced that it is feasible to save a copy of everything we ever read. MindRetrieve start modest and save only a trim down version of web pages right now. It is already very handy for me to bring back a lot of things I have recently read. It runs on Windows and Linux right now with a Mac version to follow soon.
Posted by: Wai Yip Tung | February 06, 2005 at 06:43 PM
Solving for this allowed Google to take a commanding lead in the "finding information" field. At one point in the history of the Web, human-managed hierarchical directories like Yahoo were still a valuable method to get to information kind of like what you wanted.
Posted by: dating | September 02, 2006 at 05:09 AM