vb.org Archive

vb.org Archive (https://vborg.vbsupport.ru/index.php)
-   vBulletin 4.x Add-ons (https://vborg.vbsupport.ru/forumdisplay.php?f=245)
-   -   Miscellaneous Hacks - Ban Spiders by User Agent (https://vborg.vbsupport.ru/showthread.php?t=268208)

fly 02-11-2013 05:20 PM

Quote:

Originally Posted by Max Taxable (Post 2403548)
Amazon AWS is their hosting they sell. And yes they also crawl the web: http://aws.amazon.com/search-engines/

I have it blocked as well, using this Mod.

Did you read that? There is nowhere in that link that says that Amazon themselves crawl websites. Can you even explain why a hosting company would want to catalog data from every website on the internet?

I'm wondering if there is some confusion on what a user agent is and does. The UA is the remote web crawlers way of tell you that it is there cataloging your site. It's not required that a crawler send you a UA at all. Instead, its just considered polite. If someone wanted to, they could send a completely random UA every time or not send one at all.

Since Amazon AWS is in the hosting business, they have no need to crawl websites at all. However, this doesn't PREVENT people from buying their own server from Amazon and crawling your website. If someone were to do this, the UA would be whatever they wanted it to be, not some form of "AmazonAWS".

Assuming what you're really trying to do is prevent anyone from buying a server from Amazon and accessing your website, you'll need to find all the IP blocks that AWS owns and block those. However, that is outside the scope of this mod.

Simon Lloyd 02-11-2013 07:58 PM

For reference here's what a user agent is and some extra info http://en.wikipedia.org/wiki/User_agent. All this mod is designed to do is stop bots from eating up your bandwidth by redirecting them before any content loads. To be honest you can never stop anyone who is intent on scraping your site from doing so.

Max Taxable 02-11-2013 11:00 PM

Quote:

Originally Posted by fly (Post 2403551)
Did you read that? There is nowhere in that link that says that Amazon themselves crawl websites. Can you even explain why a hosting company would want to catalog data from every website on the internet?

I'm wondering if there is some confusion on what a user agent is and does. The UA is the remote web crawlers way of tell you that it is there cataloging your site. It's not required that a crawler send you a UA at all. Instead, its just considered polite. If someone wanted to, they could send a completely random UA every time or not send one at all.

Since Amazon AWS is in the hosting business, they have no need to crawl websites at all. However, this doesn't PREVENT people from buying their own server from Amazon and crawling your website. If someone were to do this, the UA would be whatever they wanted it to be, not some form of "AmazonAWS".

Assuming what you're really trying to do is prevent anyone from buying a server from Amazon and accessing your website, you'll need to find all the IP blocks that AWS owns and block those. However, that is outside the scope of this mod.

The "amazonaws" crawlers have that designation in their UA string. Anything else coming from Amazon has it in its host description.

The rest of your missive, I am well aware of.

fly 02-12-2013 12:55 AM

ok.

Inspector G 03-03-2013 01:39 AM

I have a confusing question...
Ok I have a very small member site...like 24 members...
So when I noticed I had 35 users online most of the time and I started seeing more and more baidu spiders
I decided to do something about it...

I installed this mod.
almost instantly ...well within say 3 hours my users online soared to well over 150 on busy times like Now...tonight.
I had
Most users ever online was 247, 1 Day Ago at 12:58 AM.

With only one new account created, and maybe me or one other registered user online...

My question is this. what happened when I installed this mod to make such a drastic change in the users on my site and why?

I do not understand this and I read that the server load increases...
I find it hard to believe that anyone is finding my site via a search engine since it is a brand new .cc name and it has only been online for two months now...

Is there something about pushing away Baidu that enables more sites to come, or Spam bots?
attempting to register and what not, many are in areas that there would not be a normal user.

I see many attempts a registering and yet no more new users.,.. so I believe those are bots locking...

Please advise...

Simon Lloyd 03-03-2013 03:42 AM

What's happening is (and you'll probably find this) is because Baidu can't get in with the spiders/ip's they were using they are now trying a rotation of other ip's and bots, i use this mod myself although i don't ban the bots as i monitor their visits to further enhance any mod i make against them, i currently have 236 baidu bots (and 140 other bots/search engines) at my site.

With the mod in place and redirection working you'll find that these bots that you have banned will slowly drop off as they all get the message of the 301 permananet redirect to wherever you've decided to send them, your server load will lessen and things will be more normal :)

Simon Lloyd 03-03-2013 03:44 AM

Also do you have your robots.txt set up correctly to stop the search engines or bots that obey robots.txt from indexing pages on your site that they shouldn't like register.php, members.php ....etc?

Inspector G 03-03-2013 03:59 AM

I did not understand how to do the text part since I am what I even call very green in this aspect of Vbulleting...
so I just installed the mod...
I can wait and see if it drops off and report back...
Thanks for the help in understanding...

Simon Lloyd 03-03-2013 05:34 AM

1 Attachment(s)
Ok, what you need to do is upload the attached to your forum root, however if your forum is at this level www.mysite.com/ then edit the attached to remove /forums if your forum is at this level www.mysite.com/forums then you can just upload it to that folder.

You can add any page or file to robots.txt that you wish, just follow the same structure :)

Inspector G 03-03-2013 05:56 AM

Well thanks Simon...
Thats really nice...
I will do so immediately.
Nice to see someone really help out the Noob...lol
Thanks again I appreciate this very much...
I will report back.

Inspector G 03-03-2013 05:59 AM

So I think what you are telling me is this...
Since my site forum is at root level to edit as follows...
This...Disallow: /forums/albums.php
to This...Disallow: /albums.php

Simon Lloyd 03-03-2013 07:40 AM

yes if your forum isn't in a folder but simply "on your server" so you dont need to access a folder to get to it then thats correct!

dog-tag 03-31-2013 04:37 PM

After being only installed 10 minutes, I've seen a 20% drop in server load already. I was already blocking them with .htaccess but they were still getting in. According to AWstats bots have been hitting my server MILLIONS of times per month.

Thank you very much from the bottom of my heart, you're very talented!

Simon Lloyd 03-31-2013 05:44 PM

You're welcome, dont forget to remove them from /htaccess now as they will be adding load just being there :)

datoneer 04-01-2013 08:21 PM

Thank you good mod

Simon Lloyd 04-01-2013 08:54 PM

Glad you like it :)

bzcomputers 05-01-2013 09:29 PM

Been running this for a little over 8 months now.

This past month it blocked 6,659 bad bots. Which is very close to what it blocked on the first month I had it installed.

Baidu finally stopped coming after about 4 months. They were originally hitting the site at over 10 times an hour. Yandex is still coming but they are down to once or twice a day instead of multiple times an hour.

Most Popular blocked User Agents currently:
FunWebProducts, MSIE 6, MSIE 7, Nutch, Yandex

My Full Blocked User Agent list:
Code:

almaden
Anarchie
Artabus
ASPSeek
attach
autoemailspider
BackWeb
Baidu
Bandit
BatchFTP
BlackWidow
BoardReader
Bot\mailto:craftbot@yahoo.com
Buddy
bumblebee
CherryPicker
ChinaClaw
CICC
Collector
CoolWebSearch
Copier
Copyscape
Crescent
DIIbot
DISCo
DISCo\Pump
dotbot
Download\Demon
Download\Wonder
Downloader
Drip
DSurf15a
eCatch
EasyDL/2.99
EirGrabber
email
EmailCollector
EmailSiphon
EmailWolf
Express\WebPictures
ExtractorPro
EyeNetIE
FileHound
FlashGet
FrontPage
FunWebProducts
GetRight
GetSmart
GetWeb!
gigabaz
GNIP
Go\!Zilla
Go!Zilla
Go-Ahead-Got-It
gotit
Grabber
GrabNet
Grafula
grub-client
HMView
HTTrack
httpdown
.*httrack.*
ia_archiver
Ichiro
Image\Stripper
Image\Sucker
Indy*Library
Indy\Library
InterGET
InternetLinkagent
Internet\Ninja
InternetSeer.com
Iria
JBH*agent
JetCar
JOC\Web\Spider
JustView
larbin
LeechFTP
LexiBot
lftp
Link*Sleuth
likse
//Link
LinkWalker
Mag-Net
Magnet
Magpie
Mass\Downloader
Memo
Microsoft.URL
MIDown\tool
Mirror
Mister\PiX
Mozilla.*Indy
Mozilla.*NEWT
Mozilla*MSIECrawler
MS\FrontPage*
MSFrontPage
MSIECrawler
MSIE 2
MSIE 3
MSIE 4
MSIE 5
MSIE 6
MSIE 7
MSProxy
Navroad
NearSite
NetAnts
NetMechanic
NetSpider
Net\Vampire
NetZIP
NICErsPRO
Ninja
Nutch
Octopus
Offline\Explorer
Offline\Navigator
omgili
Openfind
Opera/1
Opera/2
Opera/3
Opera/4
Opera/5
Opera/6
Opera/7
Opera/8
PageGrabber
Papa\Foto
pavuk
pcBrowser
Ping
PingALink
Pockey
psbot
Pump
QRVA
RealDownload
Reaper
Recorder
ReGet
Scooter
Seeker
Siphon
sitecheck.internetseer.com
SiteSnagger
SlySearch
SmartDownload
Snake
sogou
Soso
SpaceBison
speedy
Spinn3r
sproose
Stripper
Sucker
SuperBot
SuperHTTP
Surfbot
Szukacz
tAkeOut
Teleport\Pro
URLSpiderPro
Vacuum
VoidEYE
Web\Image\Collector
Web\Sucker
WebAuto
[Ww]eb[Bb]andit
webcollage
WebCopier
Web\Downloader
WebEMailExtrac.*
WebFetch
WebGo\IS
WebHook
WebLeacher
WebMiner
WebMirror
WebReaper
WebSauger
Website
Website\eXtractor
Website\Quester
Webster
WebStripper
WebWhacker
WebZIP
Wget
Whacker
Widow
WWWOFFLE
x-Tractor
Xaldon\WebSpider
Xenu
Yandex
Yeti
YOUDAOBOT
Zeus.*Webster
Zeus


This new one just showed up and has been attempting to ping my site on average around a hundred times a day (started about 15 days ago):
Code:

05-01-2013 16:20:25 .
Matched bots[135]: . Ping .
With User Agent:  . A6-INDEXER/1.0 (HTTP://WWW.A6CORP.COM/A6-WEB-SCRAPING-POLICY/) .


Seems some bots come and go, just glad this mod is here!

Simon Lloyd 05-01-2013 10:29 PM

Im very glad you've found this useful, thanks for posting your updated bot list it may help others decide which to block, however i still have to mention that banning bots is a personal thing and you have to decide what it is you want to acheive from the banning and will anything you block prevent legitimate people from viewing your site.

In the above you block MSIE 7, whilst this may be good for you others may want users who still only have IE7 to be able to view their site. All i'm saying to people is think before you block :)

bzcomputers 05-01-2013 11:12 PM

What is your take on "MSIE 6"? I seem to also be getting quite a few hits from that browser as well.

Simon Lloyd 05-01-2013 11:33 PM

Personally unless you're catering for developing countries (computerwise i mean like eastern block...etc) i'd ban MSIE 6 but again have to stress it's a personal choice.

ikorolis 05-08-2013 10:50 AM

thanks
installed your mod

fxdigi-cash 06-07-2013 08:10 AM

Thanks a lot for the great mod!

I will try it out and see how things go...

Cheers

Max Taxable 06-07-2013 02:30 PM

Quote:

Originally Posted by bzcomputers (Post 2419517)
What is your take on "MSIE 6"? I seem to also be getting quite a few hits from that browser as well.

My personal take is, there aren't very many actual humans using it. It's almost always a botnet zombie computer.

And if it is a human using that dinosaur, I really don't want his/her traffic anyway.

bzcomputers 06-07-2013 08:50 PM

Quote:

Originally Posted by Max Taxable (Post 2426396)
My personal take is, there aren't very many actual humans using it. It's almost always a botnet zombie computer.

And if it is a human using that dinosaur, I really don't want his/her traffic anyway.

I came across this a couple weeks back:

http://www.ie6countdown.com/


It's a worldwide countdown Microsoft is doing tracking Internet Explorer 6 usage. They are tracking the percentage of users worldwide still using ie6.


Excluding China the percentage of users worldwide still using ie6 is much less than 1% and in China it is currently 24%. To me that is just one more reason to block "MSIE 6".

Max Taxable 06-08-2013 09:53 PM

Quote:

Originally Posted by bzcomputers (Post 2426516)
I came across this a couple weeks back:

http://www.ie6countdown.com/


It's a worldwide countdown Microsoft is doing tracking Internet Explorer 6 usage. They are tracking the percentage of users worldwide still using ie6.


Excluding China the percentage of users worldwide still using ie6 is much less than 1% and in China it is currently 24%. To me that is just one more reason to block "MSIE 6".

Yup. Like I said, people who use that dinosaur aren't desirable to have on the site, and definitely aren't worth support for their archaic, garbage browser that should be erased from the web.

XGC Viper XI 09-13-2013 07:23 PM

WARNING: For those that have that use the vBulletin Mobile Application, this plugin can and will prevent your app from being publish. if you have the UserAgent banned. I think it is the MSIE 6.

Solution: When you got to publish you app just disable this product until you have published you app. Then enable the product after words. If you have this active when trying to publish and you have it posting in a forum, look for the post that targes the API file. Then you will know what the UserAgent is that you have that is locking it down and preventing it from getting your site's information. Don't worry, when you go to publish you will know instantly.

fly 09-13-2013 08:48 PM

Why on Earth would you ban the IE6 user agent?

Max Taxable 09-13-2013 09:36 PM

Quote:

Originally Posted by fly (Post 2445448)
Why on Earth would you ban the IE6 user agent?

Because just about the only use for IE6 anymore is spambot networks, botnet zombie computers.

If some real, actual human is still using IE6 I don't want them on my site.
But, there really aren't.

DemOnstar 09-14-2013 01:35 AM

Installed and testing.

One question, the pre-filled redirect url should be left intact?

Thanks

ForceHSS 09-14-2013 07:38 AM

Quote:

Originally Posted by DemOnstar (Post 2445476)
Installed and testing.

One question, the pre-filled redirect url should be left intact?

Thanks

You can change that if u want

DemOnstar 09-14-2013 01:07 PM

Quote:

Originally Posted by ForceHSS (Post 2445517)
You can change that if u want

Thank you... I will leave as is for the present...

Simon Lloyd 09-14-2013 08:43 PM

You have the option to redirect to a site (i.e the one already installed) or directly back to the ip of the banned useragent, its all about choice really :)

DemOnstar 09-17-2013 04:50 AM

How do I know if this is working?

Haven't seen any evidence so far...What do I look for?

Simon Lloyd 09-17-2013 05:13 AM

it will take over 30 minutes to start to see differences in the WOL as the spiders get the message and a bit longer until they stop trying altogether.

The easiest way to see it working is to turn on writing to the log file, or if you dare have threads made in a forum of your choice, i advise against it as you can get thousands of posts quickly!!!!! it's only there for test purposes.

Max Taxable 09-17-2013 01:58 PM

Quote:

Originally Posted by DemOnstar (Post 2446119)
How do I know if this is working?

Haven't seen any evidence so far...What do I look for?

Smoke. Smoke starts coming out of your hard drive. :D

Simon Lloyd 09-17-2013 03:37 PM

Quote:

Originally Posted by Max Taxable (Post 2446219)
Smoke. Smoke starts coming out of your hard drive. :D

Thats only if you're a power user :D

K4GAP 11-06-2013 06:24 AM

Should I make any changes to my robot.txt file?
Right now it is blank.

Simon Lloyd 11-06-2013 07:25 AM

Hi Gary this thread has nothing to do with robots.txt files, the mod bans anything whose useragent contains any string you enter in to it.

And as a standard you should have something in your robots file as you've been shown here https://vborg.vbsupport.ru/showthread.php?t=304164, there are many threads here that contain details of robots.txt.

K4GAP 11-06-2013 08:47 AM

Oh, I guess I need to learn more about user agents and robots.

Thanks'

Simon Lloyd 11-06-2013 09:13 AM

Hi Gary, all that you need to know about useragents...etc is in the thread description. Not all bots follow the robots.txt so, with this mod you can block those bots completely and many others. What you need to do is identify your target audience, so if you are not catering for China then you'd want to block Chinese traffic, to sort the bots out you can block the likes of Baidu Sogou....etc.

I'll try and help you with whatever you need along the way so that you get to keep your bandwidth for more important users :)


All times are GMT. The time now is 03:57 AM.

Powered by vBulletin® Version 3.8.12 by vBS
Copyright ©2000 - 2025, vBulletin Solutions Inc.

X vBulletin 3.8.12 by vBS Debug Information
  • Page Generation 0.01905 seconds
  • Memory Usage 1,844KB
  • Queries Executed 10 (?)
More Information
Template Usage:
  • (1)ad_footer_end
  • (1)ad_footer_start
  • (1)ad_header_end
  • (1)ad_header_logo
  • (1)ad_navbar_below
  • (2)bbcode_code_printable
  • (10)bbcode_quote_printable
  • (1)footer
  • (1)gobutton
  • (1)header
  • (1)headinclude
  • (6)option
  • (1)pagenav
  • (1)pagenav_curpage
  • (4)pagenav_pagelink
  • (1)pagenav_pagelinkrel
  • (1)post_thanks_navbar_search
  • (1)printthread
  • (40)printthreadbit
  • (1)spacer_close
  • (1)spacer_open 

Phrase Groups Available:
  • global
  • postbit
  • showthread
Included Files:
  • ./printthread.php
  • ./global.php
  • ./includes/init.php
  • ./includes/class_core.php
  • ./includes/config.php
  • ./includes/functions.php
  • ./includes/class_hook.php
  • ./includes/modsystem_functions.php
  • ./includes/class_bbcode_alt.php
  • ./includes/class_bbcode.php
  • ./includes/functions_bigthree.php 

Hooks Called:
  • init_startup
  • init_startup_session_setup_start
  • init_startup_session_setup_complete
  • cache_permissions
  • fetch_threadinfo_query
  • fetch_threadinfo
  • fetch_foruminfo
  • style_fetch
  • cache_templates
  • global_start
  • parse_templates
  • global_setup_complete
  • printthread_start
  • pagenav_page
  • pagenav_complete
  • bbcode_fetch_tags
  • bbcode_create
  • bbcode_parse_start
  • bbcode_parse_complete_precache
  • bbcode_parse_complete
  • printthread_post
  • printthread_complete