vb.org Archive

vb.org Archive (https://vborg.vbsupport.ru/index.php)
-   vBulletin 4.x Add-ons (https://vborg.vbsupport.ru/forumdisplay.php?f=245)
-   -   Miscellaneous Hacks - Ban Spiders by User Agent (https://vborg.vbsupport.ru/showthread.php?t=268208)

Black Snow 12-10-2014 10:29 AM

Quote:

Originally Posted by Max Taxable (Post 2526554)
No sir we have been trying to solve the mystery of why Baidu gets through on some v4 installations, but not all and never a v3, and my hook conflict idea opened a new can of worms for investigation, and Ozz found something very interesting.

I never looked before but it is also getting through on my site.

CAG CheechDogg 12-10-2014 10:54 AM

Baidu has kissed my "gluteus maximus" for almost 2 years and change if not more ... So has Yandex and a handful of others as well ... I must have a "magic" forum :)

ozzy47 12-10-2014 11:04 AM

I am re working a couple of things, and then need to test further, I can then share my findings with Simon. :)

Gadget_Guy 12-10-2014 01:14 PM

I would be happy to test on my site if it helps the community.

D.

Max Taxable 12-10-2014 03:55 PM

Quote:

Originally Posted by ForceHSS (Post 2526565)
What interesting thing was found

I don't wanna flap it, already flapped too much. But I am pretty sure the problem is solved.

Gadget_Guy 12-11-2014 02:27 AM

If this helps anyone.... this is a list of what I am seeing in terms of spiders on my site with this installed.

Bing Spiders (6),
Google Favicon Spiders (9),
Proximic Spiders (135),
Baidu Spiders (175),
WinHTTP Spiders (12),
Facebook Spiders (20),
Google AdSense Spiders (7),
Magpie Spiders (9),
linkdexbot/2.0 Spiders (7),
AhrefsBot Spiders (14),
Coccoc Spiders (2),
Google AppEngine Spiders (6),
Google Spiders (40),
Sucuri Spiders (3),
Twitterbot Spiders (4),
Google FeedFetcher Spiders (3),
Apple RSS Spiders (1),
WordPress.com mShots Spiders (1),
Google Web Preview Spiders (3),
Grapeshot Spiders (2),
James BOT WebCrawler Spiders (5),
Netseer crawler/2.0 Spiders (2),
Google Images Spiders (3),
Galaxy Spiders (2),
Feedly Spiders (2),
DotBot Spiders (1),
Yahoo! Slurp Spiders (1),
360Spider Spiders (4),
Netcraft Web Server Survey Spiders (1),
NerdyBot Spiders (2),
Exabot Spiders (1),
Integrity Bot Spiders (1),
ContextAd Bot Spiders (2),
Twitturls.com (Python-urllib) Spiders (1)

I am happy to supply any information that you may find useful to assist in the work you are doing.

D.

Simon Lloyd 12-11-2014 05:34 AM

I need a snapshot of your settings for the mod as there is no way all those being entered in the mod would get past the mod!

CAG CheechDogg 12-11-2014 08:09 AM

1 Attachment(s)
This is a snapshot of the spiders that are showing up in the whos online:

https://vborg.vbsupport.ru/external/2014/12/30.jpg

What exactly do you need a snapshot in the settings Simon?

This is my list of spiders I have banned with your mod:

almaden
Anarchie
Artabus
ASPSeek
attach
autoemailspider
BackWeb
Baidu
Bandit
BatchFTP
BlackWidow
Bot\mailto:craftbot@yahoo.com
Buddy
bumblebee
CherryPicker
ChinaClaw
CICC
Collector
Copier
Copyscape
Crescent
DIIbot
DISCo
DISCo\Pump
dotbot
Download\Demon
Download\Wonder
Downloader
Drip
DSurf15a
eCatch
EasyDL/2.99
EirGrabber
email
EmailCollector
EmailSiphon
EmailWolf
Express\WebPictures
ExtractorPro
EyeNetIE
FileHound
FlashGet
FrontPage
GetRight
GetSmart
GetWeb!
gigabaz
GNIP
Go\!Zilla
Go!Zilla
Go-Ahead-Got-It
gotit
Grabber
GrabNet
Grafula
grub-client
HMView
HTTrack
httpdown
.*httrack.*
ia_archiver
Ichiro
Image\Stripper
Image\Sucker
Indy*Library
Indy\Library
InterGET
InternetLinkagent
Internet\Ninja
InternetSeer.com
Iria
JBH*agent
JetCar
JOC\Web\Spider
JustView
larbin
LeechFTP
LexiBot
lftp
Link*Sleuth
likse
//Link
LinkWalker
Mag-Net
Magnet
Magpie
magpie
Mass\Downloader
Memo
Microsoft.URL
MIDown\tool
Mirror
Mister\PiX
Mozilla.*Indy
Mozilla.*NEWT
Mozilla*MSIECrawler
MS\FrontPage*
MSFrontPage
MSIECrawler
MSProxy
Navroad
NearSite
NetAnts
NetMechanic
NetSpider
Net\Vampire
NetZIP
NICErsPRO
Ninja
Nutch
Octopus
Offline\Explorer
Offline\Navigator
omgili
Openfind
PageGrabber
Papa\Foto
PaperLiBot
pavuk
pcBrowser
Ping
PingALink
Pockey
psbot
Pump
QRVA
RealDownload
Reaper
Recorder
ReGet
Scooter
Seeker
Siphon
sitecheck.internetseer.com
SiteSnagger
SlySearch
SmartDownload
Snake
sogou
Soso
SpaceBison
speedy
Spinn3r
sproose
Stripper
Sucker
SuperBot
SuperHTTP
Surfbot
Szukacz
tAkeOut
Teleport\Pro
URLSpiderPro
Vacuum
VoidEYE
Web\Image\Collector
Web\Sucker
WebAuto
[Ww]eb[Bb]andit
webcollage
WebCopier
Web\Downloader
WebEMailExtrac.*
WebFetch
WebGo\IS
WebHook
WebLeacher
WebMiner
WebMirror
WebReaper
WebSauger
Website
Website\eXtractor
Website\Quester
Webster
WebStripper
WebWhacker
WebZIP
Wget
Whacker
Widow
WWWOFFLE
x-Tractor
Xaldon\WebSpider
Xenu
Yandex
Yeti
YOUDAOBOT
Zeus.*Webster
Zeus
baiduspider
beta.statsit.com
statsit
SiteIntel
Yandex
GomezAgent
FunWebProducts
Nesotebot
DCPbot
AOL Advertising R&D
DataCha0s
aiHitBot
Apache-HttpClient
Zend_Http_Client
ReverseGet
XXX bot Content
vBSEO
spbot
OffByOne
thyroidbuzz
AcoonBot
coccoc
xpymep
proxyproxy2884
AppEngine
start.exe
Semiocast HTTP client
Firefox/3.6.23
TurnitinBot
curl
SwpLc/1.6
GrepNetstat.com
news bot
AskTbPTV
checks
panopta
App3le
PhantomJS
AlwaysOnline
SISTRIX
proximic
CRAWL-E/0.6.4
WebMoney
Maxthon
HTMLParser
oBot
UnisterBot
ERACrawler
Butterfly
Topsy
Butterfly Topsy Crawler
Ezooms
Deepnet
Alexa
Bitlybot
Seznam
Fulltext
Facebook
Sunrise Communications AG
crawl
Crawl
MJ12bot
Bimbot
Snapbot
thunderstone
Thunderstone
grub-client
Bing
MSN
OOZBOT
Wayback Machine
Crowsnest Spider
FlipboardProxy
Feedly

Gadget_Guy 12-11-2014 10:48 AM

1 Attachment(s)
Here is my stuff:

Simon Lloyd 12-11-2014 05:06 PM

Hi Gadget Guy, remove the second picture as it has your email address in it. I see the settings are ok, now can you just copy the list as you have it (copy straight out of the textbox in the mod) sitck it in a wordpad document, zip it and attach it here so i can check that please.

Gadget_Guy 12-11-2014 05:12 PM

That is what the spiders.txt file I attached is.

D.

Simon Lloyd 12-11-2014 05:17 PM

Quote:

Originally Posted by CAG CheechDogg (Post 2526712)
This is a snapshot of the spiders that are showing up in the whos online:

https://vborg.vbsupport.ru/external/2014/12/30.jpg

What exactly do you need a snapshot in the settings Simon?

This is my list of spiders I have banned with your mod:

..........................................

There's one or two duplicates there but that doesn't matter, however rather than ban baiduspider just ban baidu, you'll have more luck with that as not all baiduspiders have the entire name in the UA, that goes for most of the bots, lets say there's a bot called Simon Lloyd Crawl Everything Everywherespider then the following will ban it:
Simon or Lloyd or Crawl or Spider.....etc
The same goes for:
Lloyd Crawl or Everything or Simon Lloyd....etc (case isnt important)

What the mod does is look for the string you entered, so if you want to ban the spider i mentioned above just Simon will do it, howevere lets say you have a friendly bot called Simon Lloyd Crawled Everything Everywherespider then to ban the first bot and allow the other you'd need to enter a string that is unique to the first one so in this case i could be:
Simon Lloyd Crawl Everything
This way it wont pick up the "Crawl" in the friendly bots name as its looking for the exact string you entered.

Hope that helps.

Simon Lloyd 12-11-2014 05:22 PM

Quote:

Originally Posted by Gadget_Guy (Post 2526788)
That is what the spiders.txt file I attached is.

D.

I need it as asked for, the reason for this is to check for machine charaters and/or leading/trailing spaces, hard returns.....etc

Gadget_Guy 12-11-2014 05:25 PM

Quote:

Originally Posted by Simon Lloyd (Post 2526789)
What the mod does is look for the string you entered, so if you want to ban the spider i mentioned above just Simon will do it, howevere lets say you have a friendly bot called Simon Lloyd Crawled Everything Everywherespider then to ban the first bot and allow the other you'd need to enter a string that is unique to the first one so in this case i could be:
Simon Lloyd Crawl Everything
This way it wont pick up the "Crawl" in the friendly bots name as its looking for the exact string you entered.

Hope that helps.


This COULD be the issue I have with mine then. When you look at my txt file you will see that I tried putting in multiple variations. That could be negating the effectiveness.

Maybe as part of the mod could be an updated list that we can copy/paste so that people like me who are clueless don't do the wrong thing.

Keeping in mind that we would want the "good" spiders to get through like google, bing, and the legit ones that are important to SEO, Adsense, and other things like that.

I hope you and Ozzy are disscusing the hook thing as well... he seemed to think that may be important with my 4.2.2 site.

I will say that when I had this mod in place for my 3.8.x site it worked perfectly and I didn't get hit hard till I upgraded to 4.2.2

I saw my server loads go way up....

D.

Simon Lloyd 12-11-2014 05:32 PM

There is already a list included and many throughout this thread, i've also explained the above before. I wouldn't update the list of spiders to ban as i've said probably over a dozen times it's a personal thing on what or who you ban.

If you want to pm me access as i've said before i'll take a look.

Gadget_Guy 12-11-2014 05:40 PM

1 Attachment(s)
Here you go.

Simon Lloyd 12-11-2014 06:11 PM

1 Attachment(s)
Right, i've been through your list, i wont comment on the bots you are banning as thats your preference, what i ahve done is ordered the list, checked for anything that shouldn't be there and removed some bots as they will be taken care of by other entries you have.

What i will say is if you are NOT using Paul Ms "Who has visited" mod and you are still seeing any of the bots on your list appear in WOL then you need to check that spiders UserAgent to see if the name or text you have in your list actually appears in the UA.

CAG CheechDogg 12-11-2014 06:48 PM

Quote:

Originally Posted by Simon Lloyd (Post 2526789)
There's one or two duplicates there but that doesn't matter, however rather than ban baiduspider just ban baidu, you'll have more luck with that as not all baiduspiders have the entire name in the UA, that goes for most of the bots, lets say there's a bot called Simon Lloyd Crawl Everything Everywherespider then the following will ban it:
Simon or Lloyd or Crawl or Spider.....etc
The same goes for:
Lloyd Crawl or Everything or Simon Lloyd....etc (case isnt important)

What the mod does is look for the string you entered, so if you want to ban the spider i mentioned above just Simon will do it, howevere lets say you have a friendly bot called Simon Lloyd Crawled Everything Everywherespider then to ban the first bot and allow the other you'd need to enter a string that is unique to the first one so in this case i could be:
Simon Lloyd Crawl Everything
This way it wont pick up the "Crawl" in the friendly bots name as its looking for the exact string you entered.

Hope that helps.

No no ...I am fine Simon, I have "ZERO" traces of baidu ... I think when you asked for the settings and a shot you were asking Gadget Guy and not me ... but I am fine, I have no problems with the mod not blocking any of the bots at all ... Thank you !!!!

Gadget_Guy 12-11-2014 07:35 PM

Do you mean this one:

https://vborg.vbsupport.ru/showthread.php?t=232636


Then, yes... I am using it.

D.

Gadget_Guy 12-11-2014 07:43 PM

Quote:

Originally Posted by Simon Lloyd (Post 2526796)
Right, i've been through your list, i wont comment on the bots you are banning as thats your preference, what i ahve done is ordered the list, checked for anything that shouldn't be there and removed some bots as they will be taken care of by other entries you have.

What i will say is if you are NOT using Paul Ms "Who has visited" mod and you are still seeing any of the bots on your list appear in WOL then you need to check that spiders UserAgent to see if the name or text you have in your list actually appears in the UA.

I have implemented your list.

In regards to your comment about "my list".... I have no idea to be honest.

Those I put in there based on what I was seeing in my WOL and putting things in there to try and block them.

I am sure I was way off base and incorrect in doing so.

So.... in light of this... if you want to provide a "proper" list... I am happy to take your guidance.

I don't know the first thing about any of this stuff and am looking to experts like yourself to assist.

D.

CAG CheechDogg 12-11-2014 07:51 PM

1 Attachment(s)
Quote:

Originally Posted by Gadget_Guy (Post 2526807)
I have implemented your list.

In regards to your comment about "my list".... I have no idea to be honest.

Those I put in there based on what I was seeing in my WOL and putting things in there to try and block them.

I am sure I was way off base and incorrect in doing so.

So.... in light of this... if you want to provide a "proper" list... I am happy to take your guidance.

I don't know the first thing about any of this stuff and am looking to experts like yourself to assist.

D.

Gadge my Man... what you need to do is ask yourself a few questions ... like how much do rankings mean to you, if you want every single search engine to crawl your site ... are you on a shared or dedicated server and if you are already having problems with bots and especially bad bots hitting your site too much ..

I for one don't care for any other search engine except for google and yahoo ... I have completely blocked most search engines that are foreign and facebook as well ... as you can see from the screenshot I only have like 10 different spiders/bots that even crawl my site and that is just how I want it ...

So put together a list of the engines and spiders that you know are giving you hell then you can add those to the list that I have or Simon and Ozzy have and you can give it one last look and just remove the ones you want your site to be crawled with ....

I added my list in a txt file if you want to try mine out and see how it works for you ...

Here it is as well ...

Simon Lloyd 12-11-2014 07:53 PM

As i've said many times in this thread, the fact that they are showing up in Paul Ms mod doesn't mean they are getting through, his kod logs them as they visit, mine redirects thema t the same time, both mods are working fine!

Just copy CAG CheechDogg's list and prune as needed. There is no "proper" list its all a personal choice!

CAG CheechDogg 12-11-2014 07:57 PM

Quote:

Originally Posted by Simon Lloyd (Post 2526813)
As i've said many times in this thread, the fact that they are showing up in Paul Ms mod doesn't mean they are getting through, his kod logs them as they visit, mine redirects thema t the same time, both mods are working fine!

Just copy CAG CheechDogg's list and prune as needed. There is no "proper" list its all a personal choice!

Right Simon ... it logs the actual visit (detection) ... I check to see what is actually getting through by going to the online.php page and selecting Display: Search Bots from the drop down and then you can see what is actually crawling the site ...

CAG CheechDogg 12-11-2014 08:00 PM

I actually use this mod here by the Great Boobo (RIP) to also display the spiders in the whos online list and it is hell of accurate !!!!

https://vborg.vbsupport.ru/showthread.php?t=243460

Gadget_Guy 12-11-2014 08:20 PM

Guys... I can't thank you enough for all the help and advice.

I feel like a huge blindfold was lifted from my eyes with the last couple posts.

I tried reading all 45 pages of this mod to really understand it... but I missed the point about the detections being just that.

I was scratching my head when WOL didn't really match up with online.php list

I want to be crawled... but only by the right spiders..... the ones that really count for a north american audience and people who use the "traditional" engines like Yahoo, Google, Bing etc

I certainly don't want to jeopardize my Google adsense ads and things like that.

I want my site to be found.... and be a source when people search for information pertaining to what we do.

D.

ozzy47 12-11-2014 08:26 PM

I took a quick look at your list, it looks like you are only blocking bad bots, so it should be ok. :)

Max Taxable 12-11-2014 10:19 PM

On my own vB 4, the only time I ever see Baidu in either Paul's mod or in WoL, is when I turn off this mod.

Paul's mod fires before this one, that is true. But that is not the reason some people get Baidu there. Baidu is also in WoL, if you look, on boards that are showing Baidu in Paul's mod.

ozzy47 12-11-2014 10:21 PM

Well I think I might have taken care of the issues, as well as added a few more things to the mod, but I need to test it a bit more to be totally sure.

Gadget_Guy 12-11-2014 10:23 PM

WOL seems to be pretty clean now....

What is this spider? I see a lot of entries for it on WOL

Proximic Spider
54.175.33.76
Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)

Edit...

This one as well:

Magpie Spider
94.228.34.203
magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)


D.

Max Taxable 12-11-2014 10:28 PM

Harmless crawlers but if you want them blocked you can put them on the list.

ozzy47 12-11-2014 10:29 PM

Proximic Spider
http://www.proximic.com/spider.html
Magpie Spider
http://www.brandwatch.com/magpie-crawler/

CAG CheechDogg 12-12-2014 02:25 AM

Quote:

Originally Posted by ozzy47 (Post 2526843)

I have had those 2 blocked for a very long time ..no need for them for my forums ....

CAG CheechDogg 12-12-2014 02:30 AM

Quote:

Originally Posted by Max Taxable (Post 2526842)
Harmless crawlers but if you want them blocked you can put them on the list.

It's not so much about if they are harmless or not, it's the amount of resources they sometimes use up crawling and the amount of sessions they leave open and the amount of spiders crawling at the same time ....for those 2 in the past I saw 23 different ips for proximic at one time and like 10 for magpie as well .. way too many sessions and spiders to be running around at the same time....

Gadget_Guy 12-12-2014 02:41 AM

So if I want to add them to my list, what do I enter?

D.

Max Taxable 12-12-2014 02:43 AM

Quote:

Originally Posted by Gadget_Guy (Post 2526866)
So if I want to add them to my list, what do I enter?

D.

Proximic
Magpie

CAG CheechDogg 12-12-2014 02:46 AM

Make sure you leave no trailing spaces or space at the start

Gadget_Guy 12-12-2014 03:14 AM

1 Attachment(s)
Okay.. now I am ready to bang my head on the wall.

Baidu is back.

Just saw it in WOL

(I grabbed two to show)

D.

Simon Lloyd 12-12-2014 04:37 AM

If you see them again please take a snapshot showing their useragents.

Gadget_Guy 12-12-2014 10:37 AM

1 Attachment(s)
Here are snapshots.

I included magpie which I added yesterday.

CAG CheechDogg 12-12-2014 10:48 AM

Quote:

Originally Posted by Gadget_Guy (Post 2526871)
Okay.. now I am ready to bang my head on the wall.

Baidu is back.

Just saw it in WOL

(I grabbed two to show)

D.

Just deny those IPs access to your site with htaccess ... that is what I also did even though I use this mod ...


All times are GMT. The time now is 09:35 AM.

Powered by vBulletin® Version 3.8.12 by vBS
Copyright ©2000 - 2025, vBulletin Solutions Inc.

X vBulletin 3.8.12 by vBS Debug Information
  • Page Generation 0.01938 seconds
  • Memory Usage 1,859KB
  • Queries Executed 10 (?)
More Information
Template Usage:
  • (1)ad_footer_end
  • (1)ad_footer_start
  • (1)ad_header_end
  • (1)ad_header_logo
  • (1)ad_navbar_below
  • (13)bbcode_quote_printable
  • (1)footer
  • (1)gobutton
  • (1)header
  • (1)headinclude
  • (6)option
  • (1)pagenav
  • (1)pagenav_curpage
  • (4)pagenav_pagelink
  • (1)pagenav_pagelinkrel
  • (1)post_thanks_navbar_search
  • (1)printthread
  • (40)printthreadbit
  • (1)spacer_close
  • (1)spacer_open 

Phrase Groups Available:
  • global
  • postbit
  • showthread
Included Files:
  • ./printthread.php
  • ./global.php
  • ./includes/init.php
  • ./includes/class_core.php
  • ./includes/config.php
  • ./includes/functions.php
  • ./includes/class_hook.php
  • ./includes/modsystem_functions.php
  • ./includes/class_bbcode_alt.php
  • ./includes/class_bbcode.php
  • ./includes/functions_bigthree.php 

Hooks Called:
  • init_startup
  • init_startup_session_setup_start
  • init_startup_session_setup_complete
  • cache_permissions
  • fetch_threadinfo_query
  • fetch_threadinfo
  • fetch_foruminfo
  • style_fetch
  • cache_templates
  • global_start
  • parse_templates
  • global_setup_complete
  • printthread_start
  • pagenav_page
  • pagenav_complete
  • bbcode_fetch_tags
  • bbcode_create
  • bbcode_parse_start
  • bbcode_parse_complete_precache
  • bbcode_parse_complete
  • printthread_post
  • printthread_complete