<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Web site harvesting â€“ itâ€™s happening now.</title>
	<atom:link href="http://gm22.net/2006/08/30/web-site-harvesting-%e2%80%93-it%e2%80%99s-happening-now/feed/" rel="self" type="application/rss+xml" />
	<link>http://gm22.net/2006/08/30/web-site-harvesting-%e2%80%93-it%e2%80%99s-happening-now/</link>
	<description>What? No lemongrass?</description>
	<lastBuildDate>Tue, 15 Dec 2009 13:04:46 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: nev</title>
		<link>http://gm22.net/2006/08/30/web-site-harvesting-%e2%80%93-it%e2%80%99s-happening-now/comment-page-1/#comment-4</link>
		<dc:creator>nev</dc:creator>
		<pubDate>Thu, 07 Sep 2006 03:52:04 +0000</pubDate>
		<guid isPermaLink="false">http://gm22.net/2006/08/30/web-site-harvesting-%e2%80%93-it%e2%80%99s-happening-now/#comment-4</guid>
		<description>Hi G,

This is nothing new. 

What about all those Search Engines indexing the web tying up bandwidth ? 

How many of those #$#% things does the world need ?

One educational facility I know of is using the Google Search. It indexes about 800,000 pages everytime. Imagine the cost of that bandwidth. Not a bad argument for a Google Appliance.

And what of the old Time Machine ( www.archive.org ). They&#039;ve been doing it for years.

Advertisements? Big bandwidth hogs... We actually pay to receive advertising.  What a bunch of suckers we are !!! I&#039;d love to see ISP&#039;s block ads altogether to preserve our precious bandwidth.

I don&#039;t block Flash for nothing you know...

Really needs the whole population to wise-up... 

OR get on the wagon ourselves :)

Like archive.org, at least the nla does it for an admirable reason :)

Consider it a history tax, Mr Ripper :)

my 0.02 cents :)</description>
		<content:encoded><![CDATA[<p>Hi G,</p>
<p>This is nothing new. </p>
<p>What about all those Search Engines indexing the web tying up bandwidth ? </p>
<p>How many of those #$#% things does the world need ?</p>
<p>One educational facility I know of is using the Google Search. It indexes about 800,000 pages everytime. Imagine the cost of that bandwidth. Not a bad argument for a Google Appliance.</p>
<p>And what of the old Time Machine ( <a href="http://www.archive.org" rel="nofollow">http://www.archive.org</a> ). They&#8217;ve been doing it for years.</p>
<p>Advertisements? Big bandwidth hogs&#8230; We actually pay to receive advertising.  What a bunch of suckers we are !!! I&#8217;d love to see ISP&#8217;s block ads altogether to preserve our precious bandwidth.</p>
<p>I don&#8217;t block Flash for nothing you know&#8230;</p>
<p>Really needs the whole population to wise-up&#8230; </p>
<p>OR get on the wagon ourselves <img src='http://gm22.net/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Like archive.org, at least the nla does it for an admirable reason <img src='http://gm22.net/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Consider it a history tax, Mr Ripper <img src='http://gm22.net/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>my 0.02 cents <img src='http://gm22.net/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Meg</title>
		<link>http://gm22.net/2006/08/30/web-site-harvesting-%e2%80%93-it%e2%80%99s-happening-now/comment-page-1/#comment-3</link>
		<dc:creator>Meg</dc:creator>
		<pubDate>Wed, 30 Aug 2006 11:15:09 +0000</pubDate>
		<guid isPermaLink="false">http://gm22.net/2006/08/30/web-site-harvesting-%e2%80%93-it%e2%80%99s-happening-now/#comment-3</guid>
		<description>It&#039;s really not any different to the way the Internet Archive crawls the net, and I don&#039;t think that privacy laws are really applicable to content that&#039;s freely available online anyway. Perhaps you&#039;re thinking of copyright laws?

Anyway, the process that&#039;s normally followed by Pandora is:

1. Site identified as &#039;significant&#039; (Guidelines from the State Library of WA for this are online here: 
http://www.slwa.wa.gov.au/pdf/pandoraoct02.pdf) 
by one of several archiving libraries in Australia. 
2. Site owner is contacted for permission to archive (I&#039;m not sure what happens if they are unable to find contact information on the owner). 
3. Site submitted to Pandora and indexed. 

This crawl (and the previous one in 2005) are different in that they are crawling all of the *.au namespace (and sites that are identified as having an Australian IP address). However the content will not become publicly available from Pandora without the site owners permission - they do state that on their information page. 

If you&#039;re ever contacted by the NLA to have your site included in Pandora it&#039;s pretty cool to know that your site is considered &quot;of interest&quot; for posterity.</description>
		<content:encoded><![CDATA[<p>It&#8217;s really not any different to the way the Internet Archive crawls the net, and I don&#8217;t think that privacy laws are really applicable to content that&#8217;s freely available online anyway. Perhaps you&#8217;re thinking of copyright laws?</p>
<p>Anyway, the process that&#8217;s normally followed by Pandora is:</p>
<p>1. Site identified as &#8217;significant&#8217; (Guidelines from the State Library of WA for this are online here:<br />
<a href="http://www.slwa.wa.gov.au/pdf/pandoraoct02.pdf)" rel="nofollow">http://www.slwa.wa.gov.au/pdf/pandoraoct02.pdf)</a><br />
by one of several archiving libraries in Australia.<br />
2. Site owner is contacted for permission to archive (I&#8217;m not sure what happens if they are unable to find contact information on the owner).<br />
3. Site submitted to Pandora and indexed. </p>
<p>This crawl (and the previous one in 2005) are different in that they are crawling all of the *.au namespace (and sites that are identified as having an Australian IP address). However the content will not become publicly available from Pandora without the site owners permission &#8211; they do state that on their information page. </p>
<p>If you&#8217;re ever contacted by the NLA to have your site included in Pandora it&#8217;s pretty cool to know that your site is considered &#8220;of interest&#8221; for posterity.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tuna</title>
		<link>http://gm22.net/2006/08/30/web-site-harvesting-%e2%80%93-it%e2%80%99s-happening-now/comment-page-1/#comment-2</link>
		<dc:creator>Tuna</dc:creator>
		<pubDate>Wed, 30 Aug 2006 05:38:58 +0000</pubDate>
		<guid isPermaLink="false">http://gm22.net/2006/08/30/web-site-harvesting-%e2%80%93-it%e2%80%99s-happening-now/#comment-2</guid>
		<description>Yes but its not all the web! see - 

http://pandora.nla.gov.au/selectionguidelinesallpartners.html

&quot;PANDORA is a selective archive. The National Library and its partners do not attempt to collect all Australian online publications and web sites, but select those that they consider are of significance and to have long-term research value.&quot;

You can also opt out.  Some Govt Agencies opt out for legal reasons.</description>
		<content:encoded><![CDATA[<p>Yes but its not all the web! see &#8211; </p>
<p><a href="http://pandora.nla.gov.au/selectionguidelinesallpartners.html" rel="nofollow">http://pandora.nla.gov.au/selectionguidelinesallpartners.html</a></p>
<p>&#8220;PANDORA is a selective archive. The National Library and its partners do not attempt to collect all Australian online publications and web sites, but select those that they consider are of significance and to have long-term research value.&#8221;</p>
<p>You can also opt out.  Some Govt Agencies opt out for legal reasons.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
