PHP5 foreach with an evolving array ByRef/ByVal

software development

#1

I have the code which in conjunction with preg_match_all will trawl the site and pull back links on it ready for you to then trawl those pages and pull back links there etc… it’s locked down so it will only look at internal links and not “bot around” other people’s sites!

I’m not going to post the full code as it could be used inappropriately out of the box, so please just trust me that it does the job!

Basically, this bit of code handles the recursion:

[code]$diffs = array_diff($allsiteurls,$processedpages);

foreach($diffs as $diffid => $diff) {
if(!in_array($diff,$processedpages)) {
$var = fread_url($diff);
ProcessPage($var, $diff);
$diffs = array_diff($allsiteurls,$processedpages);
if($debug==1) {
echo $diff." - “.sizeof($allsiteurls).”,".sizeof($processedpages);
}
}
}
[/code]The echo statement is just for debugging and tells me the size of the two arrays.

$allsiteurls is an array containing all the internal urls found by crawling the domain.

$processedpages is a list of all the pages that we’ve already analysed.

$diffs is the list of any differences between the two arrays. This is updated after every pass.

Each time ProcessPage is called, the url in question is visited and the links collected, and updated in $allsiteurls. The url we’ve cURL’d is put into $processedpages.

At the end of all this, i’m left with:

sizeof($allsiteurls) = 35 sizeof($processedpages) = 23 The problem seems to be with the pages i have which are not picked up on the first pass… eg http://www.digitalvibe.co.uk/news/news.html.

The links are picked up in $allsiteurls, but they’re not processed.

Can anyone help with this? I need to get it to recursively process the array for any differences. This is almost like it’s being sent ByVal rather than ByRef, but i’m using php5, so surely this cannot be the case (values are sent ByRef automatically with php5).

Cheers,
Karl

web design, development & seo by DigitalVibe


#2

I managed to sort it out, but thanks for any advice you lot might have given in time!

FYI, the problem wasn’t with my while statement causing a recursive loop (which i’d removed prior to creating this thread). The problem was in some other code which is now fixed… anyway, here’s the working snippet:

[code]$allsiteurls[] = $config_url;
$urlsettings[] = array($config_url,“w”,0.5);
ProcessPage($var, $config_url);

$diffs = array_diff($allsiteurls,$processedpages);

while(sizeof($diffs) > 0) {
foreach($diffs as &$diff) {
if(!in_array($diff,$processedpages)) {
$var = fread_url($diff);
ProcessPage($var, $diff);
$diffs = array_diff($allsiteurls,$processedpages);
echo $diff." - “.sizeof($allsiteurls).”,".sizeof($processedpages)." (".sizeof($diffs).")
";
}
}
}
[/code]
The first call to ProcessPage gives us the index of the site and the foundation for the rest of the script to work on. I could use a cleaner process, but it’s not really worth the time inverstment for the little gain.

The rest just keeps on calling the foreach on the newly generated diffs until we’ve run out of pages to check.

The whole idea of the system is to generate a Google Sitemap and to allow the user to:

a) set their own “changefreq” and “priority” in comment blocks, which overrides the defaults.

b) generate a new sitemap file on the fly by running one simple script.

When the script runs, the sitemap is automatically created on the server in the correct place, so it really can be used by a complete novice.

It’s something i’m going to offer to my clients as a service, as it really could prove useful to them.

I know Joomla etc have their own sitemap generators which trawl the database, but this comes into play more when used on a flat site.

Here’s me, quite pleased with myself :slight_smile:

Cheers,
Karl

web design, development & seo by DigitalVibe


#3

foreach($diffs as $diffid => $diff) { if(!in_array($diff,$processedpages)) { $diffs = array_diff($allsiteurls,$processedpages); } }There’s your problem.

:cool: openvein.org -//- One-time [color=#6600CC]$50.00 discount[/color] on [color=#0000CC]DreamHost[/color] plans: Use ATROPOS7


#4

The “&$diff” is removed now - I was mixing up code i’d left in from other tests (php4) that I’d been doing.

I blame 3:30am syndrome - the !in_array block was stupid of me… :stuck_out_tongue:

New code snippet is therefore:

while(sizeof($diffs) > 0) { foreach($diffs as $diff) { $var = fread_url($diff); ProcessPage($var, $diff); $diffs = array_diff($allsiteurls,$processedpages); } } Cheers for the heads-up!

Next step - proper error handling. - “when pages go bad”.

Cheers,
Karl

web design, development & seo by DigitalVibe