常见垃圾蜘蛛及屏蔽方法

常见垃圾蜘蛛及屏蔽方法

垃圾蜘蛛是对网站的品牌和流量没有任何实质性的帮助，并且给网站资源带来一定损耗的蜘蛛。这种蜘蛛会频繁抓取网站内容，并且利用内容做一些数据分析来达到他们的商业目的。

垃圾蜘蛛列表：
SemrushBot，这是semrush下面的一个蜘蛛，是一家做搜索引擎优化的公司，因此它抓取网页的目的就很明显了。这种蜘蛛对网站没有任何用处，好在它还遵循robots协议，因此可以直接在robots屏蔽。
DotBot, 这是moz旗下的，作用是提供seo服务的蜘蛛，但是对我们并没有什么用处。好在遵循robots协议，可以使用robots屏蔽
AhrefsBot， 这是ahrefs旗下的蜘蛛，作用是提供seo服务，对我们没有任何用处，遵循robots协议。
MJ12bot，这是英国的一个搜索引擎蜘蛛，但是对中文站站点就没有用处了，遵循robots协议。
MauiBot，这个不太清楚是什么，但是有时候很疯狂，好在遵循robots协议。
MegaIndex.ru，这是一个提供反向链接查询的网站的蜘蛛，因此它爬网站主要是分析链接，并没有什么作用。遵循robots协议。
BLEXBot, 这个是webmeup下面的蜘蛛，作用是收集网站上面的链接，对我们来说并没有用处。遵循robots协议

其他蜘蛛列表：
FeedDemon             内容采集  
BOT/0.1 (BOT for JCE) sql注入  
CrawlDaddy            sql注入  
Java                  内容采集  
Jullo                 内容采集  
Feedly                内容采集  
UniversalFeedParser   内容采集  
ApacheBench           cc攻击器  
Swiftbot              无用爬虫  
YandexBot             无用爬虫（俄罗斯搜索引擎）
AhrefsBot             无用爬虫  
jikeSpider            无用爬虫  
MJ12bot               无用爬虫  
ZmEu phpmyadmin       漏洞扫描  
WinHttp               采集cc攻击  
EasouSpider           无用爬虫  
HttpClient            tcp攻击  
Microsoft URL Control 扫描  
YYSpider              无用爬虫  
jaunty                wordpress爆破扫描器  
oBot                  无用爬虫  
Python-urllib         内容采集  
Python-requests       内容采集  
Indy Library          扫描  
FlightDeckReports Bot 无用爬虫  
Linguee Bot           无用爬虫  
Leikibot              无用爬虫
CriteoBot             无用爬虫
AmazonAdBot           无用爬虫
serpstatbot           无用爬虫
bidswitchbot          无用爬虫

屏蔽方法：
1，通过robotx.txt文件屏蔽
对于遵循robots协议的蜘蛛，可以直接在robots禁止。上面常见的无用蜘蛛禁止方法如下，将下面的内容加入到网站根目录下面的robots.txt就可以了
robotx.txt内容如下：

User-agent: SemrushBot
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: MegaIndex.ru
Disallow: /
User-agent: MauiBot
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: BLEXBot
Disallow: /

2，配置nginx屏蔽
在 /etc/nginx 目录下新增 agent_deny.conf 文件，文件内容如下：
（也可在 /etc/nginx/nginx.conf 或站点的 nginx 的 server 配置中直接添加该内容，添加后不需要再引入）

在 server 节点里配置：
#禁止Scrapy
if ($http_user_agent ~* (Scrapy|Curl|HttpClient|Go-http-client))
{
     return 403;
}

#禁止指定UA及UA为空的访问
#完整名单：FeedDemon|CCBot|JikeSpider|python-requests|SemrushBot|GrapeshotCrawler|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|DomainCrawler|HeadlessChrome|opensiteexplorer|BLEXBot|Bytespider|VelenPublicWebCrawler
if ($http_user_agent ~ "FeedDemon|CCBot|^$")
{
     return 403;
}

#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$)
{
    return 403;
}

#添加禁止返回nginx版本
server_tokens off;

nginx.conf中引入agent_deny.conf

try_files $uri @proxy;
location /(assets|images)/ {
    include agent_deny.conf;
    expires max;
}
location ~ .*.(ico|txt)$ {
    include agent_deny.conf;            
    expires max;
}
location @proxy {
    include agent_deny.conf;
    proxy_pass_header Server;
    proxy_set_header Host $http_host;
    proxy_redirect off;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Scheme $scheme;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    ...
}

3，对于不遵守robots规则的蜘蛛，可在服务端程序根据UserAgent或者ip来禁止

Nginx 爬虫蜘蛛 robotx.txt SEO