View Full Version : Remove Bot SIDs from URL Requests
calorie
09-29-2004, 10:00 PM
Hack 1: vB303_remove_bot_sids_1.txt
Okay so I notice that there are some bots where SIDs are in the requests. One such bot is msnbot, and who knows of the current code behind this bot, but it seems that it treats each different SID as a new link. Here is a quick and dirty hack to prevent this. You need the $_SERVER['HTTP_USER_AGENT'] and $_SERVER['REQUEST_URI'] array elements or their equivalents to use this mini hack. The first step of the hack prevents SIDs in new requests. The second step forces a redirect in order to strip the SIDs from links in the bot memory. There is no need to apply this hack for bots that have google or slurp@inktomi or yahoo! slurp as part of their user agent. Like I said, it is a quick and dirty hack, but it does what I need it to do. If you use this mod, a click of the install button is appreciated.
Hack 2: vB303_remove_bot_sids_2.txt
Do the following to see a list of bots that may appear on the Who's Online list: AdminCP >> vBulletin Options >> Who's Online Options >> Spider Identification Strings & Enable Spider Display & Spider Identification Description
However, according to http://www.vbulletin.com/forum/showthread.php?t=112022, the user agents that don't receive session IDs are hard coded in the sessions.php file. The bots that are hard coded are as follows: google, slurp@inktomi, yahoo! slurp
Thus the bots for the "who's online list" versus the bots in the "remove SID list" are currently not the same. This hack removes the session ids from the list of bots in the vBulletin Options rather than from those that were hard coded in the script.
It may be the case that pages were already crawled by a bot not hard coded in the "remove SID list" so those bots may spider with session ids in the requests. This hack includes an optional step to remove session ids from such bots via redirect.
Okay so I notice that there are some bots where SIDs are in the requests. One such bot is msnbot, and who knows of the current code behind this bot, but it seems that it treats each different SID as a new link. Here is a quick and dirty hack to prevent this. You need the $_SERVER['HTTP_USER_AGENT'] and $_SERVER['REQUEST_URI'] array elements or their equivalents to use this mini hack. The first step of the hack prevents SIDs in new requests. The second step forces a redirect in order to strip the SIDs from links in the bot memory. There is no need to apply this hack for bots that have google or slurp@inktomi or yahoo! slurp as part of their user agent. Like I said, it is a quick and dirty hack, but it does what I need it to do. If you use this mod, a click of the install button is appreciated.
Nice hack :) Thank you for sharing it with us!
zajako
10-01-2004, 05:45 PM
so this helps in getting results for msn search and some others?
sorry im just kinda confused.
Zachery
10-01-2004, 08:14 PM
Okay so I notice that there are some bots where SIDs are in the requests. One such bot is msnbot, and who knows of the current code behind this bot, but it seems that it treats each different SID as a new link. Here is a quick and dirty hack to prevent this. You need the $_SERVER['HTTP_USER_AGENT'] and $_SERVER['REQUEST_URI'] array elements or their equivalents to use this mini hack. The first step of the hack prevents SIDs in new requests. The second step forces a redirect in order to strip the SIDs from links in the bot memory. There is no need to apply this hack for bots that have google or slurp@inktomi or yahoo! slurp as part of their user agent. Like I said, it is a quick and dirty hack, but it does what I need it to do. If you use this mod, a click of the install button is appreciated.
All you need to do is add the useragent and its display name in the vBoptions and it will remove the session :D
calorie
10-02-2004, 06:06 AM
If a bot isn't on a remove SID list, and then it crawls, it gets a SID, and when it comes back to respider, the bot has the SID in the respider request because it was assigned a SID initally. What is suggested may strip SIDs from bot requests that are new, but for respider requests, if the bot was initially assigned a SID, the bot remembers the SID, so the SID is in the respider request. In my situation, I didn't have msnbot on a remove SID list, so after some time, the bot was making quite a lot of respider requests for the same pages but using different SIDs. By the time I realized what was happening, I was out a good chunk of bandwidth so this is the type of situation where I think this hack is useful.
Zachery
10-02-2004, 12:36 PM
If a bot isn't on a remove SID list, and then it crawls, it gets a SID, and when it comes back to respider, the bot has the SID in the respider request because it was assigned a SID initally. What is suggested may strip SIDs from bot requests that are new, but for respider requests, if the bot was initially assigned a SID, the bot remembers the SID, so the SID is in the respider request. In my situation, I didn't have msnbot on a remove SID list, so after some time, the bot was making quite a lot of respider requests for the same pages but using different SIDs. By the time I realized what was happening, I was out a good chunk of bandwidth so this is the type of situation where I think this hack is useful.
I dont think your logic is correct, i never had msnbot on the spiders list (untill after i noticed it on my forums) I checked its full location originaly and it did display a session id, and now after checking it yesterday(and it displaying as the MSNBot) it did not have a session id.
calorie
10-02-2004, 03:49 PM
I'm not talking about how bots show up in the who's online list. The only place in the vB code where I see sessions removed for bots is in the sessions.php file:
// automatically determine whether to put the sessionhash into the URL
if (sizeof($_COOKIE) > 0 OR preg_match("#(google|slurp@inktomi|yahoo! slurp)#si", $_SERVER['HTTP_USER_AGENT']))
{
// they have at least 1 cookie, so they should be accepting them
$nosessionhash = 1;
$shash = $session['sessionhash'] = '';
$surl = $session['sessionurl'] = '';
$surlJS = $session['sessionurl_js'] = '';
}
else
{
$nosessionhash = 0;
$shash = $session['sessionhash'];
$surl = $session['sessionurl'] = 's=' . $session['sessionhash'] . '&';
$surlJS = $session['sessionurl_js'] = 's=' . $session['sessionhash'] . '&';
}
If I would have had msnbot in the preg_match statement initially, msnbot would not have had SIDs in the requests.
Because msnbot was not in the preg_match statement, msnbot had SIDs in all requests until I applied this hack.
Please check this (http://www.vbulletin.com/forum/showthread.php?t=112022) thread. Maybe it explains it better than I can.
BTW, are you checking your raw server access logs? I don't see how putting msnbot in vBoptions removes the SID from requests.
Zachery
10-02-2004, 04:35 PM
I'm not talking about how bots show up in the who's online list. The only place in the vB code where I see sessions removed for bots is in the sessions.php file:
// automatically determine whether to put the sessionhash into the URL
if (sizeof($_COOKIE) > 0 OR preg_match("#(google|slurp@inktomi|yahoo! slurp)#si", $_SERVER['HTTP_USER_AGENT']))
{
// they have at least 1 cookie, so they should be accepting them
$nosessionhash = 1;
$shash = $session['sessionhash'] = '';
$surl = $session['sessionurl'] = '';
$surlJS = $session['sessionurl_js'] = '';
}
else
{
$nosessionhash = 0;
$shash = $session['sessionhash'];
$surl = $session['sessionurl'] = 's=' . $session['sessionhash'] . '&';
$surlJS = $session['sessionurl_js'] = 's=' . $session['sessionhash'] . '&';
}
If I would have had msnbot in the preg_match statement initially, msnbot would not have had SIDs in the requests.
Because msnbot was not in the preg_match statement, msnbot had SIDs in all requests until I applied this hack.
Please check this (http://www.vbulletin.com/forum/showthread.php?t=112022) thread. Maybe it explains it better than I can.
BTW, are you checking your raw server access logs? I don't see how putting msnbot in vBoptions removes the SID from requests.There is a secdtion in the vBulletin 3 options area that lets you specific which useragets are spiders, once they are defined as spiders, they no longer every get a session id/
AdminCP > vBulletin Options > Who's Online Options > Spider Identification Strings & Spider Identification Description
Enter an unique identifier for each Search Engine spider that you wish to recognize. This should be something unique to the spider's HTTP USER AGENT. Please place one per line. Case is not important and the previous option needs to be enabled for identification to occur
Enter the text that you wish to display for each of the above spiders on Who's Online. You need to place the spiders description on the same line as the spider's identifier above. For example, if you place 'google' as the third spider above, place 'Google' on the third line to the right.
calorie
10-02-2004, 09:04 PM
Please read this (http://www.vbulletin.com/forum/showthread.php?t=112022) vB.com thread. It indicates that the "who's online list" and the "remove SID list" are not the same. That thread is dated from August. Has something in the vB code changed since August? Where in the code are bot SIDs removed from the who's online list?
nexialys
10-02-2004, 09:43 PM
yes, that filter related to the WOL would be better to be applyed globally to the SID list, so we can really filter what kind of spider can browse the site... i have built that feature in IPB, so i suppose it's easy to do for vB!
Erwin
10-02-2004, 11:25 PM
Someone should integrate the 2. :) It would not be too hard.
calorie
10-03-2004, 03:42 AM
Okay, the second hack in the first post of this thread uses the bots listed in the vBulletin Options.
AlexanderT
10-09-2004, 04:25 PM
calorie thank you! Just by accident I noticed that the two IPs using most of my bandwidth were msnbot and jetbot these days. And the logs revealed that they were constantly browsing my forum with new session strings. A nightmare!
Notice though that for vB303_remove_bot_sids_2 you could probably use the datastore cache, thus saving one costly query.
BamaStangGuy
11-13-2004, 12:26 AM
I'm really confused.... what is msnbots unique identifier.... and what do I need to do so they dont get sids....
T2DMan
11-19-2004, 10:54 PM
Always apprehensive about adding hacks when it adds additional load to the server (more lines of code). But this one looks to reduce the amount of downloads that the spiders will potentially make. So it should mean less bandwidth and less server load from the bots.
Good hack.
ChuanSE
11-22-2004, 07:05 AM
SO, what is the final conclusion about this all?
Is there a hack or update available?
thx
agiacosa
02-03-2005, 09:05 AM
Has this been resolved?
agiacosa
02-03-2005, 09:20 AM
Instructions say "$zzzz_domain_tld = "http://www.yourdomain.tld"; //////////// CONFIGURE THIS VARIABLE - NO ENDING SLASH *************"
Is it www.mydomain.tld or www.mydomain.com?
calorie
02-10-2005, 06:24 AM
The vB 3.0.6 code in includes/sessions.php still *cough* does a hard remove of SIDs from bot requests.
- as of vB 3.0.3: (google|slurp@inktomi|yahoo! slurp)
- as of vB 3.0.4: (google|slurp@inktomi|yahoo! slurp)
- as of vB 3.0.5: (google|slurp@inktomi|yahoo! slurp)
- as of vB 3.0.6: (google|msnbot|yahoo! slurp)
This means that setting WOL bots via vBoptions does not automatically imply removal of SIDs from every bot request.
Note that WOL settings versus SID removal are two different things, as of the last time I checked (see this (http://www.vbulletin.com/forum/showthread.php?t=112022) thread).
For as much as Zachery is a sweetie, as of vB 3.0.6, WOL bots via vBoptions do not automatically remove SIDs from every bot request.
Both hack1 and hack2 posted should still work for vB 3.0.3 through vB 3.0.6., and while I briefly looked at datastore, hack2 still uses a query.
Also note that, although MSNbot was added in includes/sessions.php as of vB 3.0.6, it will not prevent MSNbot (or any other bot) from making requests with SIDs if said bot has already requested pages using SIDs.
That is where the optional portion of the hacks comes into play! I have modified my optional portion, to be placed at the start of includes/init.php, as shown below. Of course, you could PHP include the code just the same.
Now, you need to realize that the below code is rather 'buttoned down' in that listed bots can only crawl forumdisplay, showthread, printthread, and index, and only certain query string type pieces related to those pages.
I worked my optional portion this way because I have no need for bots to consider, for example, showthread.php?t=xyz&page=a&pp=A different from showthread.php?t=xyz&page=a&pp=B, index.php? different from index.php, etcetera.
In my mind, robots.txt and meta tags options, etcetera, are not quite flexible enough, and do not have a fast enough response. Rather, I choose to 'button down that hatch' so to speak with forced 301s as shown below.
Of course, the below code does not preclude the use of a .htaccess file (your OS willing) so, whatever you do, the way you decide to handle bots is ultimately up to you, your OS willing.
/************************************************** ************************************************** ************************************************** *******************/
// are $_SERVER['HTTP_USER_AGENT'] and $_SERVER['REQUEST_URI'] defined on your server?
// if the answer is no, do not apply this hack, as this hack needs those $_SERVER elements
// is your vB forum located at http://www.your-domain.com/index.php on your server?
// if the answer is yes, do not apply this hack, as this hack only works for forums located
// at http://www.your-domain.com/your-forum-dir/index.php
// what is your domain uri - no ending slash
$zzzz_domain_tld = "http://www.YOUR-DOMAIN.COM";
// what are your forum directories - separate with | character - begin slash - no ending slash
$zzzz_forum_dirs = "/forum|/forum/archive";
// what forum pages to allow - separate with | character - no extension as .php is assumed
// note: at max you can allow forumdisplay, showthread, printthread, index - no showpost, etcetera
$zzzz_forum_pages = "forumdisplay|showthread|printthread|index";
// what bots to redirect - separate with | character - bot name must be part of the bot user agent
$zzzz_redirect_bots = "msnbot|gigabot|yahoo|google|jeeves|bot|crawl|seek| wisenut|teoma";
/************************************************** ************************************************** ************************************************** *******************/
$zzzz_pages_allowed = "(($zzzz_forum_dirs)/($zzzz_forum_pages)\.php((/|[?])?([a-z]+[=][a-z]+[&])?([tf][=-][0-9]+([&](page)[=][0-9]+)?([-][p][-][0-9]+)?)?(\.html)?)?)";
if (preg_match("#($zzzz_redirect_bots)#si",$_SERVER['HTTP_USER_AGENT'])) {
if (preg_match("#(s|sessionhash)=[a-z0-9]{32}?&?#si",$_SERVER['REQUEST_URI'])) {
$zzzz_destination = preg_replace("/(s|sessionhash)=[a-z0-9]{32}?&?/","",$_SERVER['REQUEST_URI']);
zzzz_doRedirect($zzzz_domain_tld,$zzzz_destination );
}
if (eregi("$zzzz_pages_allowed(.*)",$_SERVER['REQUEST_URI'],$zzzz_regs)) {
if (!empty($zzzz_regs[6])) {
$zzzz_destination = eregi_replace($zzzz_regs[6],"",$zzzz_regs[1]);
}
elseif (!empty($zzzz_regs[12])) {
$zzzz_destination = $zzzz_regs[1];
}
if (!empty($zzzz_regs[6]) || !empty($zzzz_regs[12])) {
$zzzz_destination = eregi_replace("($zzzz_forum_pages)\.php[?]?$","",$zzzz_destination);
zzzz_doRedirect($zzzz_domain_tld,$zzzz_destination );
}
}
if (!eregi("(($zzzz_forum_dirs)/?$|$zzzz_pages_allowed)",$_SERVER['REQUEST_URI'])) {
zzzz_doRedirect($zzzz_domain_tld,"");
}
if (eregi("(.*)[?]$",$_SERVER['REQUEST_URI'],$zzzz_regs)) {
zzzz_doRedirect($zzzz_domain_tld,$zzzz_regs[1]);
}
}
function zzzz_doRedirect($zzzz_domain_tld,$zzzz_destination ) {
header("HTTP/1.1 301 Moved Permanently");
header("Location: $zzzz_domain_tld$zzzz_destination");
exit();
}
cellardoor
02-28-2005, 12:09 AM
I'm confused :ermm:
calorie
02-28-2005, 02:24 AM
Here is just one example where a bot indexed a thread, before vB 3.0.6 was released, and still remembers the SID even though reindexing a vB 3.0.7 board:
207.46.98.56 - - [26/Feb/2005:10:31:06 -0800] "GET /forum/showthread.php?s= 6f58fd4a031cc78ad7043cdfa0de3287 &t=952 HTTP/1.0" 301 0 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
207.46.98.56 - - [26/Feb/2005:10:34:22 -0800] "GET /forum/showthread.php?t=952 HTTP/1.0" 200 36160 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
Note how the first request has a SID in the request, and note how it is 301 redirected, hence the second request. Do tail -f on your access log and watch.
vBulletin® v3.8.12 by vBS, Copyright ©2000-2024, vBulletin Solutions Inc.