It's been a week at my new job and they've tasked me to write up some simple beginner scripts. This is my first time actually working with PHP beyond writing simple "echo" statements just for the hell of it.
The first task was pretty simple, scrape a univesitie's blog site for their posts and save them to your local database. (Management has never heard of RSS feeds :P )
Simple enough, just Google for a good Html parser library and read up on the docs for it. Behold!
http://simplehtmldom.sourceforge.net/
Here's some sample on how it's supposed to be used:
Fantastic, and fits in well with my prior experience in scraping HTML in C# using HtmlAgilityPack. It was a pretty painless transition.
Soon, I had my POCO objects created full of parsed goodness.
But how would I save this collection of NewsObjects to the database?
Google gave me zilch. The fact is, searching for a MySQL PHP tutorial is like searching through raw sewage. Old, outdated, plain old WRONG information wins the results.
I then searched for a simple ORM and found RedBean - and it was good!
http://redbeanphp.com/
Installation is dead simple, just include it in your scripts. No settings, no hassles.
Next, setup the connections and whatnot.
In this simple example, let's save some tasks to a todolist table. First we create the objects the ORM will recognize and be able to use.
And save the object to the database. Easy peasy.
It's that simple!
Here's the actual scraper script I wrote if you're interested in how it works.
Next time you think about using an ORM for PHP, give RedBean a shot. It's incredibly lightweight and simple. Going back to Entity Framework will be strange after this. :)
The first task was pretty simple, scrape a univesitie's blog site for their posts and save them to your local database. (Management has never heard of RSS feeds :P )
Simple enough, just Google for a good Html parser library and read up on the docs for it. Behold!
http://simplehtmldom.sourceforge.net/
Here's some sample on how it's supposed to be used:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Fantastic, and fits in well with my prior experience in scraping HTML in C# using HtmlAgilityPack. It was a pretty painless transition.
Soon, I had my POCO objects created full of parsed goodness.
But how would I save this collection of NewsObjects to the database?
Google gave me zilch. The fact is, searching for a MySQL PHP tutorial is like searching through raw sewage. Old, outdated, plain old WRONG information wins the results.
I then searched for a simple ORM and found RedBean - and it was good!
http://redbeanphp.com/
Installation is dead simple, just include it in your scripts. No settings, no hassles.
<?php
# Including the RedBean library.
include('rb.php');
Next, setup the connections and whatnot.
# Prepare the Bean for the data connection magic.
# I have created the database manually in PHPMyAdmin under utf8_unicode_ci.
R::setup('mysql:host=localhost;dbname=todolist','root','');
In this simple example, let's save some tasks to a todolist table. First we create the objects the ORM will recognize and be able to use.
# The table "task" has NOT been created at this point. RedBean will do this automatically, I understand.
$task = R::dispense('task');
$task->duedate = "20/04/2013";
$task->title = "Cínco de Máyo";
And save the object to the database. Easy peasy.
# Persist to database. $id = R::store($task); echo "<p>Done and done.</p>"; ?>
It's that simple!
Here's the actual scraper script I wrote if you're interested in how it works.
Next time you think about using an ORM for PHP, give RedBean a shot. It's incredibly lightweight and simple. Going back to Entity Framework will be strange after this. :)
<?php
include('simple_html_dom.php');
include('rb.php');
include ('NewsModel.php');
# Setup RedBean to work with a database.
R::setup('mysql:host=localhost;dbname=noticias','root','');
set_time_limit(0);
# Declare variable to hold all parsed news items.
$parsedNews = array();
# Grab page number 1, and parse that first.
$initialPage = file_get_html('http://www.uvm.cl/noticias_mas.shtml');
parse_page_for_news($initialPage, $parsedNews);
# Parse every subsequent page.
$totalPageCount = find_total_page_count($initialPage);
for ($i = 2; $i <= $totalPageCount; $i++) {
echo "$i<br />\n"; flush();
$url = "http://www.uvm.cl/noticias_mas.shtml?AA_SL_Session=34499aef1fc7a296fb666dcc7b9d8d05&scrl=1&scr_scr_Go=" . $i;
$page = file_get_html($url);
echo "pagina:"; flush();
parse_page_for_news($page, $parsedNews);
}
# Save each parsed news to the database.
foreach($parsedNews as $tmpNews) {
# If this news item already exists, continue in the foreach loop.
$noticiaExistente = R::findOne('news',' SourceUrl = ? ', array($tmpNews->get_SourceUrl()));
if (!empty($noticiaExistente)) {
continue;
}
$noticia = R::dispense('news');
$noticia->Image = $tmpNews->get_Image();
$noticia->Date = $tmpNews->get_Date();
$noticia->Title = $tmpNews->get_Title();
$noticia->SourceUrl = $tmpNews->get_SourceUrl();
$noticia->Description = $tmpNews->get_Description();
$noticia->Content = $tmpNews->get_Content();
$id = R::store($noticia);
}
# Disconnect from the database.
R::close();
# ------------------------------------- HELPER METHODS -------------------------------------- #
# Function returns the amount of pages in the blog.
function find_total_page_count($page) {
foreach ($page->find('div.enclose-scroller') as $link) {
$links = $link->find('a.scroller');
//return $links[3]->plaintext;
return 2;
}
}
# Fuction receives an HTML Dom object, and the library works agianst that single HTML object.
function parse_page_for_news ($page, &$parsedNews) {
foreach($page->find('#cont2 p') as $element) {
$newItem = new NewsModel();
# Parse Image.
foreach ($element->find('img') as $image) {
$newItem->set_Image($image->src);
}
# Parse Date.
foreach ($element->find('span.fechanoticia') as $fecha) {
$newItem->set_Date($fecha->innertext);
}
# Parse Title.
foreach ($element->find('a') as $title) {
$newItem->set_Title($title->innertext);
}
# Parse SourceUrl.
foreach ($element->find('a') as $sourceurl) {
$newItem->set_SourceUrl("http://www.uvm.cl/" . $sourceurl->href);
}
# Parse Description.
foreach ($element->find('a') as $link) {
$link->outertext = '';
}
foreach ($element->find('span') as $link) {
$link->outertext = '';
}
foreach ($element->find('img') as $link) {
$link->outertext = '';
}
$newItem->set_Description($element->innertext);
# Parse Content.
$newsContent = parse_html_body_of_blog_post($newItem->get_SourceUrl());
$newItem->set_Content($newsContent);
# Add this new News item to the ParsedNews collection.
$parsedNews[] = $newItem;
}
}
# Function that returns the html for each blog post.
function parse_html_body_of_blog_post ($urlToBlogPost) {
$page = file_get_html($urlToBlogPost);
foreach($page->find('#cont2') as $element) {
foreach($element->find('h2') as $header) {
$header->outertext = '';
}
foreach($element->find('h3') as $header) {
$header->outertext = '';
}
foreach($element->find('span.resumen') as $header) {
$header->outertext = '';
}
return $element->outertext;
}
}
?>
0 Comments On This Entry
Trackbacks for this entry [ Trackback URL ]
1 user(s) viewing
1 Guests
0 member(s)
0 anonymous member(s)
0 member(s)
0 anonymous member(s)
About Me

Bienvenidos! I'm a USA ex-pat living in Bolivia for the past 10 years. Web development is my forte with a heavy lean for usability and optimization. I'm fluent in both English and Spanish. I guest write for the popular Python website Python Central. Visit my website.
My Blog Links
Recent Entries
-
-
-
-
How to create a signature form for iPad and mobile devices using HTML5 and Canvas.
on Nov 27 2012 08:15 AM
-
Recent Comments
-
laytonsdad
on Apr 30 2013 11:30 AM
Dream.In.Code Badge Generator! Share your flair on your site or blog.
-
-
Jstall
on Nov 04 2012 09:18 AM
The Pragmatic Bookshelf mega blowout sale - 40% off select Ruby on Rails books.
-
-
tylrwb
on Jun 26 2012 07:34 PM
C# and MVC3 - Uploading and parsing an Excel document is easier than it seems.
Categories
|
|



Leave Comment








|