I have the code which in conjunction with preg_match_all will trawl the site and pull back links on it ready for you to then trawl those pages and pull back links there etc... it's locked down so it will only look at internal links and not "bot around" other people's sites!
I'm not going to post the full code as it could be used inappropriately out of the box, so please just trust me that it does the job!
Basically, this bit of code handles the recursion:
$diffs = array_diff($allsiteurls,$processedpages);
foreach($diffs as $diffid => $diff) {
if(!in_array($diff,$processedpages)) {
$var = fread_url($diff);
ProcessPage($var, $diff);
$diffs = array_diff($allsiteurls,$processedpages);
if($debug==1) {
echo $diff." - ".sizeof($allsiteurls).",".sizeof($processedpages);
}
}
}
The echo statement is just for debugging and tells me the size of the two arrays.
$allsiteurls is an array containing all the internal urls found by crawling the domain.
$processedpages is a list of all the pages that we've already analysed.
$diffs is the list of any differences between the two arrays. This is updated after every pass.
Each time ProcessPage is called, the url in question is visited and the links collected, and updated in $allsiteurls. The url we've cURL'd is put into $processedpages.
At the end of all this, i'm left with:
sizeof($allsiteurls) = 35
sizeof($processedpages) = 23
The problem seems to be with the pages i have which are not picked up on the first pass... eg http://www.digitalvibe.co.uk/news/news.html.
The links are picked up in $allsiteurls, but they're not processed.
Can anyone help with this? I need to get it to recursively process the array for any differences. This is almost like it's being sent ByVal rather than ByRef, but i'm using php5, so surely this cannot be the case (values are sent ByRef automatically with php5).
Cheers,
Karl
web design, development & seo by DigitalVibe