Scraping a site using PHP




If you’ve ever wanted to include content from another site on something you’re working on, then this is probably one of the simplest ways to do it using PHP. I’ll show you how to scrape a site for specific content using a simple to use PHP library.

Web scraping refers to the act of programatically parsing content from another site and extract key information from it, sort of like a human would if they were asked to ‘Go on Amazon and find the prices for all items that come up when you search for “teddy bears”‘. However, websites often have API’s that you can use to access the raw data. If this isn’t the case, then you have to get more creative.

With regard to legality, web scraping is definitely a grey area. You firstly need to make sure you’re not breaking terms and conditions of the site (although, even this may not be enough, from a legal standpoint, for the defending website to make a claim). If you are duplicating contents and designs, then this is where you start treading on thin ice. Duplicating facts, on the other hand, should be fine.

For the most up to date information, I suggest checking out the Wikipedia entry on web scraping. I’d say you’re safe as long as you’re just harvesting information and not changing the state of anything you access (for example, there was a case of software placing automatic bids on eBay that went to trial and eBay won) on a public website.

Technologies Used:

  • PHP

Demo

None really needed.

Libraries used in this demonstration:

Browser support:

Supported by all browsers, but your server requires PHP.

Step 1: Download and include the library

Go to the PHP simple HTML DOM parser website and download the latest version. Once downloaded, extract simple_dom_parser.php to your project directory.

Open up your project file, in my case I have ‘scraped.php’ and in this file I have the following lines so far:

<?php
include("simple_html_dom.php");
?>

Fairly simple, I’m just including the library for use later on.

Step 2: Get and analyse the HTML of the site you want to scrape

Freelancers.net Project Page
Freelancers.net Project Page

The examples on the PHP Simple HTML DOM parser website should be fairly easy to follow, but I’ll include a few examples here. One website I check a lot for work is Freelancers.net, but I don’t really like having to go there every time I want to see if a new job has been posted.

You can sign up for daily emails, but these get annoying, so I’m going to take the first 10 job listings and insert them into my personal Chrome home page instead (if you haven’t made yourself a personal homepage, you should!).

To get the site’s HTML we’re going to use a function built into simple_html_dom called file_get_html:

$html = file_get_html("http://freelancers.net/project");

This just loads the contents and turns it into a simple_html_dom object, allowing us to traverse and search the tree easily using simple_html_dom’s built in functions. Once we have that, we need to take a look at the HTML of the site we want to scrape. Looking at the source code for Freelancers.net, we find this:

<div id="yw0" class="list-view">
<div class="summary"></div>
<div class="items">
<div class="view">
<h4 style="padding-bottom: 3px;margin: 0;"><a href="/project/22756-Expression-Engine-development">Expression Engine development</a></h4>
<p style="padding-bottom: 15px;margin: 0;">
<em><font color="#993300">posted on</font> 16th August 2013</em><br/>
<em><font color="#993300">applications received</font> 2</em><br/>
We currently have a shortage of time for a build that needs to be complete by the start of September. 
The site is generally using one layout with a variety of on page content that admin user can &...
</p>
</div>
<div class="view">
<h4 style="padding-bottom: 3px;margin: 0;"><a href="/project/22753-Online-marketing-PR-advertising-in-sports-travel-leisure">Online marketing/PR/advertising in sports/travel/leisure</a></h4>
<p style="padding-bottom: 15px;margin: 0;">
<em><font color="#993300">posted on</font> 16th August 2013</em><br/>
<em><font color="#993300">applications received</font> 3</em><br/>
Our project will be focused on promoting a campaign/event at the end of September, towards adventure/sports/travel/mainstream lifestyle audiences who are 18-35 and ABC1 M/F. 
We are looking for a s...
</p>
</div>
<!-- more entries... -->
<div class="view">
<h4 style="padding-bottom: 3px;margin: 0;"><a href="/project/22739-Developer-Programmer-Required-">Developer/Programmer Required </a></h4>
<p style="padding-bottom: 15px;margin: 0;">
<em><font color="#993300">posted on</font> 13th August 2013</em><br/>
<em><font color="#993300">applications received</font> 3</em><br/>
We are looking for a passionate, knowledgeable and dedicated web-developer/programmer work on our new project. 
This is an exciting project for all involved with great exposure and expectations. Yo...
</p>
</div>
</div>
<div class="pager">
<div style="text-align:center;"><a href=""><span style="color:grey;">&lt; Previous</span></a>&nbsp;|&nbsp;<a href="/project/index?Projectuk_page=2">Next &gt;</a></div>
</div>
<div class="keys" style="display:none" title="/project"><span>22756</span><span>22753</span><span>22752</span><span>22751</span><span>22750</span><span>22749</span><span>22747</span><span>22743</span><span>22742</span><span>22739</span></div>
</div>

Eurgh…not very well structured if you ask me. But no-one is asking me, so that’s okay, this means good practice for us. In any case, we’re lucky that there’s a regular structure to all the jobs, so this should be very quick.

Step 3: Scrape the information

Using the built in ‘find’ function on simple_html_dom we can get the container fairily easily, then loop over the items in the container extracting each little bit of information we’re presented with. I’ll store the results in an array and do the presentation later on, I find separating the business from the presentation tends to be a good idea (following MVC design patterns here):

$items = $html->find('div.items .view');
$jobs = array();
foreach($items as $item){
$job = array();
$job["title"] = $item->find('h4')[0]->plaintext;
$jobs[] = $job;
}

So, here we’ve used the find function to find all divs with the class: items (there’s only one, but it doesn’t have an id unfortunately) then all items with the class:view under that. This successfully returns just our job postings.

Then, we initiate an array to store our job information and finally, start looping over our view divs.  Looking back at the HTML for the Freelancers.net job listings above, we can see that the job title is stored in an H4 tag, which is the only H4 tag so we can just take the first result using the [0] array notation, and a simple plaintext result from that give us the job listing title.

Now, we probably want some more information, finding the link is easy enough:

foreach($items as $item){
$job = array();
$job["title"] = $item->find('h4')[0]->plaintext;
$job["link"] = "http://freelancers.net" . $item->find('h4')[0]->find('a')[0]->href;
$jobs[] = $job;
}

Notice I had to include the base url, as all urls are relative as they appear in the HTML. Now, getting the number of applicants, the submission date and the description will be a bit trickier as the information isn’t actually split into different DOM elements properly…eurgh…

PROBLEM: The information we want isn’t presented in a well ordered manner with no single selector that will give us the date, description number of applicants separately.

SOLUTION 1: Just pull the p tag off the page and use the information as it is.

This is as simple as:

$job["details"] = $item->find('p')[0];

However, we then can’t work with, organise and style the data ourselves, so I’m going to go with:

SOLUTION 2: Use relative string lengths to pull out the data we actually need.

To get the date, for example, we will need to take the plaintext result from the first em tag within the p tag and subtract from that the plaintext result from the font tag within the em tag. This might look something like this:

    $wholeText = $item->find('p')[0]->find('em')[0]->plaintext;
$unwantedText = $item->find('p')[0]->find('em')[0]->find('font')[0]->plaintext;
// We take a substr of our wholeText to extract the date.
// The +1 is for the space at the beginning of the date string.
$job["date"] = strtotime(substr($wholeText, strlen($unwantedText) + 1));

Not too convoluted, but it sure ain’t pretty. But hey, web scraping tends not to be very pretty anyway, if it gets results it gets results. Notice I’ve also converted it to a date using PHP’s built in strtotime function.

Note: I was able to use the simple strtotime function as the dates are presented in an unmistakable way, however, if you have dates in the format 01/02/03, PHP doesn’t know if this is mm/dd/yy or dd/mm/yy or even yy/mm/dd, so a better way to do this might be to use DateTime::createFromFormat($format, $dateString).

Note2: I’ve used the procedural strtotime function here, but you could just as easily use new DateTime($dateString) here if you need or prefer the object oriented way. In my case, it is just as simple to use strtotime, but using DateTime is probably preferential in most cases.

Now we can use the same method to get the amount of applicants as well as the description:

foreach($items as $item){
$job = array();
$job["title"] = $item->find('h4')[0]->plaintext;
$job["link"] = "http://freelancers.net" . $item->find('h4')[0]->find('a')[0]->href;
$wholeDateText = $item->find('p')[0]->find('em')[0]->plaintext;
$unwantedDateText = $item->find('p')[0]->find('em')[0]->find('font')[0]->plaintext;
// Note: The +1 here accounts for the space at the start of the date string.
$job["date"] = strtotime(substr($wholeDateText, strlen($unwantedDateText) + 1));
$wholeApplicantsText = $item->find('p')[0]->find('em')[1]->plaintext;
$unwantedApplicantsText = $item->find('p')[0]->find('em')[1]->find('font')[0]->plaintext;
// Note: The +1 here accounts for the space at the start of the number of applicants string.
$job["applications"] = intval(substr($wholeApplicantsText, strlen($unwantedApplicantsText) + 1));
$wholeText = $item->find('p')[0]->plaintext;
// Note: The +11 here accounts for all the extra spaces and linebreaks that appear in the plaintext.
$job["description"] = substr($wholeText, strlen($wholeDateText) + strlen($wholeApplicantsText) + 11);
$jobs[] = $job;
}

And that’s all the information we have access to on this page.

Step 4: Use the information

Now we have the information in a nice array, we can use this information wherever we want. Here is what I have on my personal homepage:

Post its generated from the web scraper
Post its generated from the web scraper

And the HTML/PHP used to generate this:

<div id="freelance-jobs">
<?php foreach($jobs as $job):?>
<div class="job">
<h4 class="job-title"><a href="<?= $job['link']?>"><?= $job['title']?></a></h4>
<div class="details">
<span class='applications'>Applications: <?= $job['applications']?> </span>
<span class='date-posted'>Date Posted: <?= date('dS F Y', $job['date'])?></span>
</div>
<div class="description">
<p><?= $job['description']?></p>
</div>
<div class="link">
<a class="see-more" href="<?= $job['link']?>">See more...</a>
</div>
</div>
<?php endforeach;?>
</div> <!-- #freelance-jobs -->

and the styles (note: this is written using LESS, which if you’re not using already you should be):

#freelance-jobs {
width: 100%;
margin: 64px;
float: left;
font-family: Open Sans, "Open Sans", sans-serif;
.job {
display: block; 
float: left;
width: 240px;
height: 240px;
overflow: hidden;
background: #FFFF66;
padding: 8px 16px;
box-sizing: border-box;
margin: 32px;
.job-title {
margin: 6px 0;
font-size: 0.9em;
}
.details {
font-size: 0.7em;
span {
display: block;
}
}
.description {
font-size: 0.7em;
p {
margin: 6px 0 0;
}
}
.link {
margin: 0;
.see-more {
margin: 0;
font-size: 0.7em;
}
}
}
}

and now I have a nice simple way of seeing if any new jobs have popped up.

Step 4: Extend, extend, extend…

Now we have our basic functionality, there are a few things that we could improve upon, especially if we were using this in a production environment:

  • Cache the data (we don’t want to make too many repeated calls to the site when the site itself only updates once or twice a day)
  • Store the data locally in case of no web connection. This could either be in a database or file based storage. We could even use HTML5 localStorage to accomplish the same thing.
  • Scrape the job posting pages themselves for more information, such as budget, the full description. This is as simple as making another call to file_get_html on the url we scraped and scraping that page as well.
  • Scrape more than the first page. We could find the next page link in the pager div and scrape as many pages as we want. I personally only wanted the first page.

Conclusion

Web scraping with PHP is super simple, but it does have a few drawbacks. Notably, a lot of data is actually generated through AJAX these days and inserted with javascript, which renders this method somewhat defunct. For these pages, you can use more advanced libraries such as Scrapy for Python or by making calls to the AJAX source itself.

You can also run a headless browser, which is about the most reliable way to do it, such as PhantomJS although this is a lot slower as you will have to wait for every asynchronous ajax call. You can then use something to hook this up to PHP and reap the rewards.

However, for a simple job, PHP Simple HTML DOM parser does a great job and is so quick to use, you just can’t argue with the results.

6 Replies to “Scraping a site using PHP”

  1. I’m new to PHP and this seems really interesting but I’m overwhelmed with it. I have tried to find a basic simple “a, b, c” steps that work off of very simple basics. I’m looking to pull from my own pages and file B’s Div sections be pulled by the main file, File A (template filled by file B). I would like File Bs Divs to be like comments, images and links.

    It seems like this would help a lot but I don’t need to go any further than just the first layer of divs and display those. Any clarification? Thanks.

  2. Hi i am trying to scrap a website but i am unable to curl,file_get_contents,file_get_html, even i have tried proxies also but all in vain.Even site down loader didn’t work.

Leave a Reply