Subscribe to Sergio Tapia - Lost in the GC.        RSS Feed
-----

A simple, elegant PHP ORM library - RedBeanPHP

Icon Leave Comment
It's been a week at my new job and they've tasked me to write up some simple beginner scripts. This is my first time actually working with PHP beyond writing simple "echo" statements just for the hell of it.

The first task was pretty simple, scrape a univesitie's blog site for their posts and save them to your local database. (Management has never heard of RSS feeds :P )

Simple enough, just Google for a good Html parser library and read up on the docs for it. Behold!

http://simplehtmldom.sourceforge.net/

Here's some sample on how it's supposed to be used:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 



Fantastic, and fits in well with my prior experience in scraping HTML in C# using HtmlAgilityPack. It was a pretty painless transition.

Soon, I had my POCO objects created full of parsed goodness.

But how would I save this collection of NewsObjects to the database?

Google gave me zilch. The fact is, searching for a MySQL PHP tutorial is like searching through raw sewage. Old, outdated, plain old WRONG information wins the results.

I then searched for a simple ORM and found RedBean - and it was good!

http://redbeanphp.com/

Installation is dead simple, just include it in your scripts. No settings, no hassles.

<?php

# Including the RedBean library.
include('rb.php');


Next, setup the connections and whatnot.


# Prepare the Bean for the data connection magic.
# I have created the database manually in PHPMyAdmin under utf8_unicode_ci.
R::setup('mysql:host=localhost;dbname=todolist','root','');



In this simple example, let's save some tasks to a todolist table. First we create the objects the ORM will recognize and be able to use.


# The table "task" has NOT been created at this point. RedBean will do this automatically, I understand.
$task = R::dispense('task');
$task->duedate = "20/04/2013";
$task->title = "Cínco de Máyo";



And save the object to the database. Easy peasy.

# Persist to database.
$id = R::store($task);

echo "<p>Done and done.</p>";

?>


It's that simple!

Here's the actual scraper script I wrote if you're interested in how it works.

Next time you think about using an ORM for PHP, give RedBean a shot. It's incredibly lightweight and simple. Going back to Entity Framework will be strange after this. :)


<?php

include('simple_html_dom.php');
include('rb.php');
include ('NewsModel.php');

# Setup RedBean to work with a database.
R::setup('mysql:host=localhost;dbname=noticias','root','');

set_time_limit(0);



# Declare variable to hold all parsed news items.
$parsedNews = array();

# Grab page number 1, and parse that first.
$initialPage = file_get_html('http://www.uvm.cl/noticias_mas.shtml');
parse_page_for_news($initialPage, $parsedNews);

# Parse every subsequent page.
$totalPageCount = find_total_page_count($initialPage);
for ($i = 2; $i <= $totalPageCount; $i++) {
    echo "$i<br />\n"; flush();

    $url = "http://www.uvm.cl/noticias_mas.shtml?AA_SL_Session=34499aef1fc7a296fb666dcc7b9d8d05&scrl=1&scr_scr_Go=" . $i;
    $page = file_get_html($url);

    echo "pagina:"; flush();
    parse_page_for_news($page, $parsedNews);
}

# Save each parsed news to the database.
foreach($parsedNews as $tmpNews) {

    # If this news item already exists, continue in the foreach loop.
    $noticiaExistente = R::findOne('news',' SourceUrl = ? ', array($tmpNews->get_SourceUrl()));
    if (!empty($noticiaExistente)) {
        continue;
    }

    $noticia = R::dispense('news');
    $noticia->Image = $tmpNews->get_Image();
    $noticia->Date = $tmpNews->get_Date();
    $noticia->Title = $tmpNews->get_Title();
    $noticia->SourceUrl = $tmpNews->get_SourceUrl();
    $noticia->Description = $tmpNews->get_Description(); 
    $noticia->Content = $tmpNews->get_Content();
    $id = R::store($noticia);
}

# Disconnect from the database.
R::close();





# ------------------------------------- HELPER METHODS -------------------------------------- #



# Function returns the amount of pages in the blog.
function find_total_page_count($page) {
    foreach ($page->find('div.enclose-scroller') as $link) {
        $links = $link->find('a.scroller');
        //return $links[3]->plaintext;
        return 2;
    }
}


# Fuction receives an HTML Dom object, and the library works agianst that single HTML object.
function parse_page_for_news ($page, &$parsedNews) {

    foreach($page->find('#cont2 p') as $element) {
    
        $newItem = new NewsModel();

        # Parse Image.
        foreach ($element->find('img') as $image) {
            $newItem->set_Image($image->src);
        }

        # Parse Date.
        foreach ($element->find('span.fechanoticia') as $fecha) {
            $newItem->set_Date($fecha->innertext);
        }

        # Parse Title.
        foreach ($element->find('a') as $title) {
            $newItem->set_Title($title->innertext);
        }

        # Parse SourceUrl.
        foreach ($element->find('a') as $sourceurl) {
            $newItem->set_SourceUrl("http://www.uvm.cl/" . $sourceurl->href);
        }

        # Parse Description.
        foreach ($element->find('a') as $link) {
            $link->outertext = '';
        }
        foreach ($element->find('span') as $link) {
            $link->outertext = '';
        }
        foreach ($element->find('img') as $link) {
            $link->outertext = '';
        }
        $newItem->set_Description($element->innertext);

        # Parse Content.
        $newsContent = parse_html_body_of_blog_post($newItem->get_SourceUrl());
        $newItem->set_Content($newsContent);


        # Add this new News item to the ParsedNews collection.
        $parsedNews[] = $newItem;

    }
} 

# Function that returns the html for each blog post.
function parse_html_body_of_blog_post ($urlToBlogPost) {
    $page = file_get_html($urlToBlogPost);

    foreach($page->find('#cont2') as $element) {
        foreach($element->find('h2') as $header) {
            $header->outertext = '';
        }

        foreach($element->find('h3') as $header) {
            $header->outertext = '';
        }

        foreach($element->find('span.resumen') as $header) {
            $header->outertext = '';
        }

        return $element->outertext;
    }
}

?>


0 Comments On This Entry

 

Trackbacks for this entry [ Trackback URL ]

There are no Trackbacks for this entry

1 user(s) viewing

1 Guests
0 member(s)
0 anonymous member(s)

Google

About Me

Posted Image


Bienvenidos! I'm a USA ex-pat living in Bolivia for the past 10 years. Web development is my forte with a heavy lean for usability and optimization. I'm fluent in both English and Spanish. I guest write for the popular Python website Python Central. Visit my website.

Categories