锘??xml version="1.0" encoding="utf-8" standalone="yes"?> 浠庢妧鐩告潵璇達紝瀹炵幇鎶撳彇緗戦〉鍙兘騫朵笉鏄竴浠跺緢鍥伴毦鐨勪簨鎯咃紝鍥伴毦鐨勪簨鎯呮槸瀵圭綉欏電殑鍒嗘瀽鍜屾暣鐞嗭紝閭f槸涓浠墮渶瑕佹湁杞婚噺鏅鴻兘錛岄渶瑕佸ぇ閲忔暟瀛﹁綆楃殑紼嬪簭鎵嶈兘鍋氱殑浜嬫儏銆備笅闈竴涓畝鍗曠殑嫻佺▼錛?/p>
鍦ㄨ繖閲岋紝鎴戜滑鍙槸璇翠竴涓嬪浣曞啓涓涓綉欏墊姄鍙栫▼搴忋?/p>
棣栧厛鎴戜滑鍏堢湅涓涓嬶紝濡備綍浣跨敤鍛戒護琛岀殑鏂瑰紡鏉ユ壘寮緗戦〉銆?/p>
telnet somesite.com 80 浣跨敤telnet灝辨槸鍛婅瘔浣犲叾瀹炶繖鏄竴涓猻ocket鐨勬妧鏈紝騫朵笖浣跨敤HTTP鐨勫崗璁紝濡?GET鏂規硶鏉ヨ幏寰楃綉欏碉紝褰撶劧錛屾帴涓嬫潵鐨勪簨浣犲氨闇瑕佽В鏋怘TML鏂囨硶錛岀敋鑷寵繕闇瑕佽В鏋怞avascript錛屽洜涓虹幇鍦ㄧ殑緗戦〉浣跨敤Ajax鐨勮秺鏉ヨ秺澶氫簡錛岃屽緢澶氱綉欏靛唴瀹歸兘鏄氳繃Ajax鎶鏈姞杞界殑錛屽洜涓猴紝鍙槸綆鍗曞湴瑙f瀽HTML鏂囦歡鍦ㄦ湭鏉ヤ細榪滆繙涓嶅銆傚綋鐒訛紝鍦ㄨ繖閲岋紝鍙槸灞曠ず涓涓潪甯哥畝鍗曠殑鎶撳彇錛岀畝鍗曞埌鍙兘鍋氫負涓涓緥瀛愶紝涓嬮潰榪欎釜紺轟緥鐨勪吉浠g爜錛?/p>
涓婇潰鐨勪唬鐮佸氨涓嶅繀澶氳浜嗭紝澶у鍙互鍘昏瘯璇曘備笅闈㈡槸濡備綍浣跨敤涓婇潰鐨勪唬鐮侊細 涓嬮潰鏄竴浜涘拰緗戠粶鐖櫕鐩稿叧鐨勫紑婧愮綉緇滈」鐩?/p>
from:
GET /index.html HTTP/1.0
鎸夊洖杞︿袱嬈?/p>
鍙栫綉欏?
for each 閾炬帴 in 褰撳墠緗戦〉鎵鏈夌殑閾炬帴
{
if(濡傛灉鏈摼鎺ユ槸鎴戜滑鎯寵鐨?|| 榪欎釜閾炬帴浠庢湭璁塊棶榪?
{
澶勭悊瀵規湰閾炬帴
鎶婃湰閾炬帴璁劇疆涓哄凡璁塊棶
}
}
require “rubygems”
require “mechanize”
class Crawler < WWW::Mechanize
attr_accessor :callback
INDEX = 0
DOWNLOAD = 1
PASS = 2
def initialize
super
init
@first = true
self.user_agent_alias = “Windows IE 6″
end
def init
@visited = []
end
def remember(link)
@visited << link
end
def perform_index(link)
self.get(link)
if(self.page.class.to_s == “WWW::Mechanize::Page”)
links = self.page.links.map {|link| link.href } - @visited
links.each do |alink|
start(alink)
end
end
end
def start(link)
return if link.nil?
if(!@visited.include?(link))
action = @callback.call(link)
if(@first)
@first = false
perform_index(link)
end
case action
when INDEX
perform_index(link)
when DOWNLOAD
self.get(link).save_as(File.basename(link))
when PASS
puts “passing on #{link}”
end
end
end
def get(site)
begin
puts “getting #{site}”
@visited << site
super(site)
rescue
puts “error getting #{site}”
end
end
end
require “crawler”
x = Crawler.new
callback = lambda do |link|
if(link =~/\\.(zip|rar|gz|pdf|doc)
x.remember(link)
return Crawler::PASS
elsif(link =~/\\.(jpg|jpeg)/)
return Crawler::DOWNLOAD
end
return Crawler::INDEX;
end
x.callback = callback
x.start(”http://somesite.com”)
http://coolshell.cn/?p=27
]]>