Using cURL to fetch a web page

software development

#1

I have struggled with writing something to fetch a webpage because of the correct restrictions placed on fetching pages from the Wild Windy Webwoods. So I have written some rambling notes to show how cURL helped me.

They are meant for us simple folk who can just work out that Wednesday follows Tuesday and not you flash guys and gals who know that the cube root of 386749765 is… uhm whatever!

If you write a PHP program that is designed to go and fetch a webpage from the World Wide Web you soon find out that you are not allowed to because of a sensible restiction placed on the use of fopen(), simplexml_load_file and the like.

This restiction is absolutely vital to maintain the integrity of a shared server system and you would have to be ten bob short of a quid to allow the retrieval of all that untrusted material from the web. So an intermediate step is used. You can use cURL (http://uk.php.net/manual/en/ref.curl.php).

But cURL looks complicated, especially to me, and it just looks like something else to learn. But all you need is that webpage being fetched, so you only need a bit of cURL code to do that. The good news is that it has already been written. The bad news is that it is not obvious where to stick it!

So here is an example I have come up with to explain what I do.

This bit of code from the O’Reilly book Learning PHP 5 by David Sklar (Copyright 2004 O’Reilly Media, Inc., 0-596-00560-1)fetches a nice list of items from a Yahoo News RSS feed.

(save as rsayahoo.php)

<?php $xml = simplexml_load_file('http://rss.news.yahoo.com/rss/oddlyenough'); print "
    \n"; foreach ($xml->channel->item as $item){ print "
  • $item->title
  • \n"; } print "
"; ?>

But the simplexml_load_file in the function is not allowed out into the wild. So you need an intermediary step, cURL, to go and fetch the page.

So the cURL sample would be.

(save as geturl.php)

<?php $ch = curl_init("http://rss.news.yahoo.com/rss/oddlyenough"); $fp = fopen("example_htmlpage.html", "w"); curl_setopt($ch, CURLOPT_FILE, $fp); curl_setopt($ch, CURLOPT_HEADER, 0); curl_exec($ch); curl_close($ch); fclose($fp); ?>

(The page is fetched and re-written in the local file example_htmlpage.html, creating and/or overwriting it as necessary.)

Now when you upload this code to your web site folder and run it, you will get just a blank page! But it has done what it should and has fetched “http://rss.news.yahoo.com/rss/oddlyenough” and re-written it out as a local file in your web site folder.

Now, “example_htmlpage.html” is a local file and you can fopen() this to your hearts content.

Going back to the original problem, the rsayahoo.php program, you can upload it to your web site folder ( where “example_htmlpage.html” has just been written out) with this change for the simplexml_load_file option. :-

(upload as localrsayahoo.php)

<?php $xml = simplexml_load_file('example_htmlpage.html'); print "
    \n"; foreach ($xml->channel->item as $item){ print "
  • $item->title
  • \n"; } print "
"; ?>

This will now print out the items as you wanted it to in the first place.

Although the two programs (the cURL part and the PHP) have been shown as two seperate actions you could of course combine them into one program.

<?php $ch = curl_init("http://rss.news.yahoo.com/rss/oddlyenough"); $fp = fopen("example_homepage.html", "w"); curl_setopt($ch, CURLOPT_FILE, $fp); curl_setopt($ch, CURLOPT_HEADER, 0); curl_exec($ch); curl_close($ch); fclose($fp); $xml = simplexml_load_file('example_homepage.html'); print "
    \n"; foreach ($xml->channel->item as $item){ print "
  • $item->title
  • \n"; } print "
"; ?>

This is just a basic no-frills example to show the place of cURL in the greater scheme of things. You could start adding fancy PHP bits, and more cURL options from http://uk.php.net/manual/en/ref.curl.php.


N[color=#6600CC]or[/color]m

Opinions are my own views and are not the views of DreamHost.
Any advice offered by me should be acted upon only at your own risk.


#2

This is a great tutorial for curl. It definatly should be added to the wiki too.

–Matttail
art.googlies.net - personal website


#3

Norm,

That is great! Thanks for masterfully simplifying that whole process.
–rlparker


#4

[quote]to go and fetch a webpage from the World Wide Web

[/quote]

Hey, cool, but like, isn’t that encouraging, aiding and abetting a form of bandwidth theft? :wink:

For that matter, isn’t “forcing” you to retrieve a :cool: image instead of ;-> text also a form of bandwidth theft?

:open_mouth:

/|
/ \


#5

Is showing someone how to use a gun encouraging murder? Knowing how to use tools is an entirely different thing than encouraging them to use them in an illegal or immoral manner.

There are lots of legitimate reasons for these tools that don’t presume “bad acts” :slight_smile:

–rlparker


#6

As long as the user has permission from the owner of the site being scraped, bandwidth theft is not taking place. In some cases (like RSS feeds, for example), permission can be implied.

I’m sure you already knew this, “anonymous2”. You certainly seem to have a penchant for stirring up trouble.


Simon Jessey | Keystone Websites
Save $97 on yearly plans with promo code [color=#CC0000]SCJESSEY97[/color]


#7

Excellent job Norm. I agree with Matttail it should be on the wiki.
Silk


#8

Thanks to all (x=x-1) for the input. :slight_smile:
I will put it in the Wiki once I have worked out Wikiology. Starting at paragraph 3 perhaps.


N[color=#6600CC]or[/color]m

Opinions are my own views and are not the views of DreamHost.
Any advice offered by me should be acted upon only at your own risk.


#9

I noticed in the original article this: “Now, “example_htmlpage.html” is a local file and you can fopen() this to your hearts content.”

Being very non-technical (and probably of limited understanding), is this saying that, for example, if one was to create an RSS feed that consists solely of say articles residing on your own website hosted by dreamhost, then fopen() can be used without needing curl?


#10

It doesn’t matter whether the RSS feed resides on a DreamHost website or not, it matters how the information is accessed.

If the RSS feed is a local file, such as “/home/username/example.com/feed.xml” fopen() can be used without curl.

If the RSS feed is dynamic and created by a CGI script, such as “http://example.com/feed.php”, then you need to use curl.

The short answer is that if you need an Internet connection to access it, you should use curl to copy the content to a local file, then use fopen() to open the local file.

An analogy is having an assistant (curl) and having to deal with a different department (web site). If you want something from the different department, have your assistant call them and write it down and hand you the paper (local file). If you already have what you want on paper, you don’t need the assistant or any phone calls to the other department.

:cool: Perl / MySQL / HTML+CSS