Quote:
Originally Posted by NLP-er
Some people was already asking me - it is written in Java and uses its testing engine - without IDE you will not be able to even run it. If someone feel good in Java, is able to run tests and edit code to change some hardcoded things, then PM me
About cache tables - those split short, medium and long translations. Thanks that it works much faster, because short and medium data have full indexes and unique constraints which preserves data duplication. How those tables correlates with 50-60k of pages in sitemap - they have something about 50k translated pages from my forum 
|
I made a really basic spider in PHP. Heres an element of it. If you build on it you could make it work nicely
PHP Code:
$html = file_get_contents('http://www.example.com');
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'<br />';
}