浏览 483 次
|
该帖已经被评为新手帖
|
|
|---|---|
| 作者 | 正文 |
|
最后更新时间:2007-11-16 关键字: Heritrix
我现在是用一台主机抓取数据,所以我想把Heritrix的链接散列到多个线程中,可是当我散列的ELFHashQueueAssignmentPolicy写好后,第一次执行的时候,只能解析出30个dns:任务就自动的结束了,可是,当第二次或是第三次的时候,就可以实现多个线程了
另外我已经把Heritrix.properties文件和AbstractFrontier中相应的位置都已经改了,希望您能帮我看看,谢谢了。 /******************************************************************************* * 文件说明: * * 项目名: WebCrawler * 文件名: ELFHashAssignmentPolicy.java * 包名: com.hotct.heritrixExt.common.frontier * * 创建人: zhangzhenxin * 创建时间: 下午03:50:01 * 创建日期: 2007-10-30 ******************************************************************************/ package com.hotct.heritrixExt.common.frontier; import java.util.logging.Level; import java.util.logging.Logger; import org.apache.commons.httpclient.URIException; import org.archive.crawler.datamodel.CandidateURI; import org.archive.crawler.framework.CrawlController; import org.archive.crawler.frontier.HostnameQueueAssignmentPolicy; import org.archive.crawler.frontier.QueueAssignmentPolicy; import org.archive.net.UURI; import org.archive.net.UURIFactory; /** * <h>类型描述</h> * * @author zhangzhenxin * @date 2007-10-30 */ public class ELFHashAssignmentPolicy extends QueueAssignmentPolicy { private static final Logger logger = Logger .getLogger(ELFHashAssignmentPolicy.class.getName()); private static String DEFAULT_CLASS_KEY = "default..."; private static final String DNS = "dns"; /** * */ @Override public String getClassKey(CrawlController controller, CandidateURI cauri) { String uri = cauri.getUURI().toString(); String scheme = cauri.getUURI().getScheme(); String candidate = null; try { if (scheme.equals(DNS)) { if (cauri.getVia() != null) { // Special handling for DNS: treat as being // of the same class as the triggering URI. // When a URI includes a port, this ensures // the DNS lookup goes atop the host:port // queue that triggered it, rather than // some other host queue UURI viaUuri = UURIFactory.getInstance(cauri.flattenVia()); candidate = viaUuri.getAuthorityMinusUserinfo(); // adopt scheme of triggering URI scheme = viaUuri.getScheme(); } else { candidate = cauri.getUURI().getReferencedHost(); } } else { // String uri = cauri.getUURI().toString(); long hash = ELFHash(uri); candidate = Long.toString(hash % 100); } if (candidate == null || candidate.length() == 0) { candidate = DEFAULT_CLASS_KEY; } } catch (URIException e) { logger.log(Level.INFO, "unable to extract class key; using default", e); candidate = DEFAULT_CLASS_KEY; } return candidate.replace(':', '#'); } public static long ELFHash(String str) { long hash = 0; long x = 0; for (int i = 0; i < str.length(); i++) { hash = (hash << 4) + str.charAt(i); if ((x = hash & 0xF0000000L) != 0) { hash ^= (x >> 24); hash &= ~x; } } return (hash & 0x7FFFFFFF); } } 声明:JavaEye文章版权属于作者,受法律保护。没有作者书面许可不得转载。
|
|
| 返回顶楼 | |
|
最后更新时间:2008-04-06
我也遇到相同的问题 ,不知道lz有没有解决 ?
|
|
| 返回顶楼 | |


