nginx性能优化:屏蔽禁止Scrapy等垃圾蜘蛛方法
国外的各种蜘蛛、机器人不停的抓取我们的网站,国外的很多收录其实对于我们来讲没啥用,所以我们可以选择将其屏蔽。
新建agent_deny.conf文件:找到文件目录/www/server/nginx/conf文件夹下面,新建一个文件agent_deny.conf
agent_deny.conf
#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定UA及UA为空的访问
if ($http_user_agent ~* "FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Bytespider|Ezooms|Googlebot|JikeSpider|SemrushBot|^$" ) {
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}
打开网站设置--配置文件:在root...后加入include agent_deny.conf;
include agent_deny.conf;
重启nginx
注意事项,有许多人可能和我一样用的宝塔计划任务进行发布,以上代码会将宝塔计划任务的URL触发给拒绝了,这是为什么呢?首先我们从日志当中看一下宝塔计划任务URL触发的访问记录
124.220.170.95 - - [14/Oct/2022:23:07:57 +0800] "GET /api/publishcontent HTTP/2.0" 200 78 "-" "curl/7.61.1"
我们可以看出来宝塔的头部为“curl/7.61.1”所以我们需要将“禁止Scrapy等工具的抓取”中的curl给删掉,或者将整段给注释掉
#禁止Scrapy等工具的抓取
#if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
#}
#禁止指定UA及UA为空的访问
if ($http_user_agent ~* "FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Bytespider|Ezooms|Googlebot|JikeSpider|SemrushBot|^$" ) {
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
#if ($request_method !~ ^(GET|HEAD|POST)$) {
#return 403;
#}
或
#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|HttpClient)) {
return 403;
}
#禁止指定UA及UA为空的访问
if ($http_user_agent ~* "FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Bytespider|Ezooms|Googlebot|JikeSpider|SemrushBot|^$" ) {
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}
THE END