One of the sites at FAST.hit got a huge amount of traffic last night. More, probably, that this site would do in 3 months. Having done some guessing work and checking the logs, we noticed that a lot of references to the user agent http://pandora.nla.gov.au/crawl.html. We followed the link, and, funny enough, found a “Notice to Webmasters”. This notice basically says: “In August and September 2006 the National Library of Australia is undertaking a comprehensive crawl and harvest of the Australian web domain using the services of the Internet Archive…”
They crawl your site and not just index your pages, but actually copy everything for “archiving purposes”.
Hello…???
- has anyone heard about privacy laws?
- Who is going to pay for all that traffic?
- Will the site owners be ever notified about this?
- How do they determine which site is Australian and which is not (based on IP/ DNS/ Domain)?
August 30th, 2006 at 1:38 pm
Yes but its not all the web! see -
http://pandora.nla.gov.au/selectionguidelinesallpartners.html
“PANDORA is a selective archive. The National Library and its partners do not attempt to collect all Australian online publications and web sites, but select those that they consider are of significance and to have long-term research value.”
You can also opt out. Some Govt Agencies opt out for legal reasons.
August 30th, 2006 at 7:15 pm
It’s really not any different to the way the Internet Archive crawls the net, and I don’t think that privacy laws are really applicable to content that’s freely available online anyway. Perhaps you’re thinking of copyright laws?
Anyway, the process that’s normally followed by Pandora is:
1. Site identified as ’significant’ (Guidelines from the State Library of WA for this are online here:
http://www.slwa.wa.gov.au/pdf/pandoraoct02.pdf)
by one of several archiving libraries in Australia.
2. Site owner is contacted for permission to archive (I’m not sure what happens if they are unable to find contact information on the owner).
3. Site submitted to Pandora and indexed.
This crawl (and the previous one in 2005) are different in that they are crawling all of the *.au namespace (and sites that are identified as having an Australian IP address). However the content will not become publicly available from Pandora without the site owners permission - they do state that on their information page.
If you’re ever contacted by the NLA to have your site included in Pandora it’s pretty cool to know that your site is considered “of interest” for posterity.
September 7th, 2006 at 11:52 am
Hi G,
This is nothing new.
What about all those Search Engines indexing the web tying up bandwidth ?
How many of those #$#% things does the world need ?
One educational facility I know of is using the Google Search. It indexes about 800,000 pages everytime. Imagine the cost of that bandwidth. Not a bad argument for a Google Appliance.
And what of the old Time Machine ( www.archive.org ). They’ve been doing it for years.
Advertisements? Big bandwidth hogs… We actually pay to receive advertising. What a bunch of suckers we are !!! I’d love to see ISP’s block ads altogether to preserve our precious bandwidth.
I don’t block Flash for nothing you know…
Really needs the whole population to wise-up…
OR get on the wagon ourselves
Like archive.org, at least the nla does it for an admirable reason
Consider it a history tax, Mr Ripper
my 0.02 cents