Curl upgrades stopped my PHP scripts

software development

#1

My screen scraping scripts using curl have stopped working. phpinfo shows my current version is 5.5.38, so not the 5.6 upgrade that was announced. I have found info that there are security upgrades for curl at StackOverflow. But I haven’t been able to make this work. The top of my script looks like this now.
[php]require_once("…/ChartDirector/lib/phpchartdir.php");
//$site = “CHCW”;
$site = $_GET[‘site’];
$pcode = $_GET[‘pcode’];

set_time_limit(1090000);

//retrieve flow data from BOR database for site, pcode
//$theurl=“http://www.usbr.gov/pn-bin/yak/webarccsv.pl
//."?station=".$site."&year=2000&month=10&day=1&year=2017&month=5&day=20&pcode=$pcode";
date_default_timezone_set(“America/Los_Angeles”) ;
$yr = date(“Y”);
$month = date(“m”);
$day = date(“d”);
$theurl=“http://www.usbr.gov/pn-bin/yak/webarccsv.pl
."?station=".$site."&year=2000&month=10&day=1&year=$yr&month=$month&day=$day&pcode=$pcode";

#echo $theurl, “
”;

#new lines
$aPost = array(
‘file’ => new CURLFile($localFile),
‘default_file’ => ‘html_version.html’,
‘expiration’ => (231246060)
)
#end new lines
$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt($ch, CURLOPT_SAFE_UPLOAD, true);
curl_setopt ($ch, CURLOPT_URL, $theurl);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
#new lines
curl_setopt($ch, CURLOPT_BUFFERSIZE, 128);
curl_setopt($ch, CURLOPT_POSTFIELDS, $aPost);
$contents = curl_exec($ch);
curl_close($ch);

// display file
echo $contents;
[/php]

Before I added the new lines, the echo of $theurl showed the correct results, but the echo of $contents was empty. With the new lines, I am getting a parse error on the the $ch = curl_init(); so I might get getting close. Where’s the propblem that I am not seeing?


#2

You seem to be missing a semicolon after the closing parentheses for the $aPost array value.


#3

Thanks. It looks like the missing ; was picked up from the StackOverflow post. Now the script runs without error in a browser, but the echo $contents; just returns an empty arrary Array ([0]=>) and no records are captured in the database.


#4

That link in StackOverflow on the security upgrades to curl seem to be for the curl upload side, while I’m trying to figure out my my curl downloads stopped working last month on DH. Was there some other change to DH php around this time?

If I just go back to my script:
[php]date_default_timezone_set(“America/Los_Angeles”) ;
$yr = date(“Y”);
$month = date(“m”);
$day = date(“d”);
$theurl=“http://www.usbr.gov/pn-bin/yak/webarccsv.pl
."?station=".$site."&year=2000&month=10&day=1&year=$yr&month=$month&day=$day&pcode=$pcode";

#echo $theurl, “
”;

$ch = curl_init();
$timeout = 5; // set to zero for no timeout
#curl_setopt($ch, CURLOPT_SAFE_UPLOAD, false);
curl_setopt ($ch, CURLOPT_URL, $theurl);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$contents = curl_exec($ch);
curl_close($ch);

// display file
echo $contents;[/php]

$contents just shows an empty array.


#5

Am I getting mislead by my test of echoing $contents?
print_r($contents);
print_r(array_values($contents));

also seem to show that $contents is empty.
[hr]
I tired running my test code in some php sandboxes like http://sandbox.onlinephpfunctions.com/ and http://phptester.net/, but on these systems curl has been disbled so curl_init() immediately causes an error, so these tests are useless.


#6

after finding that PHP sandbox sites were useless to test curl execution because of security concerns of those sites, I found a site with a simple explanation of curl at http://oooff.com/php-scripts/basic-curl-scraping-php/basic-scraping-with-curl.php and copied their example to my site and it ran fine, displaying their url ooof.com with an echo of the scraped page array. So now I am convinced that my site can run curl and that I am not getting misled by the echo.

However, when I copied my url, https://www.usbr.gov/pn-bin/yak/webarccsv.pl?station=YRPW&year=2000&month=10&day=1&year=2017&month=02&day=23&pcode=QD into their script, I got nothing back from the echo.

I checked my url again in a browser and found it was still valid. But it returned rather slow. I checked the site phpinfo and found the maximum execution time set at 250. Seems like enough time, but I also tested with adding $timeout = 2000 in the script. Still shows the scrape array to be empty.

The size of the BOR url page gets bigger every day because another line is added for each date. Is there any other size or time limit that DH might have changed last month that I could be hitting? Or that I am hitting because of the increasing screen size?

[php]<?
ini_set(‘max_execution_time’, 300); //300 seconds = 5 minutes
#$url = “oooff.com”;
$url=“http://www.usbr.gov/pn-bin/yak/webarccsv.pl?station=YRPW&year=2000&month=10&day=1&year=2017&month=02&day=23&pcode=QD”;
echo $url;
$timeout = 2000;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
echo $curl_scraped_page;
?>
[/php]

Edit: I just tried changing the url to start at year 2005 rather than 2000 to shorten the size of the page, and maybe shorten the run time at the BOR site, but still got nothing back in the scraped_page array.


#7

I finally got it. I had to change the url string from http to https ! Something must have changed at the BOR around the first of the calendar year.

Still I don’t understand why the url would always work as http when I took the echo of my url and copied it into a browser.


#8

Browsers will automatically follow redirection, however with cURL you can turn that on/off:

URLOPT_FOLLOWLOCATION TRUE to follow any "Location: " header that the server sends as part of the HTTP header (note this is recursive, PHP will follow as many "Location: " headers that it is sent, unless CURLOPT_MAXREDIRS is set).