Extract data from HTML file generated by Excel

  • (2 Pages)
  • +
  • 1
  • 2

23 Replies - 9486 Views - Last Post: 25 May 2011 - 01:10 PM Rate Topic: -----

#1 techmonkey   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 24-May 11

Extract data from HTML file generated by Excel

Posted 24 May 2011 - 11:18 PM

Hi,

I'm designing a website for an online radio station. One show a week they have a DJ who has his own top 40 show with HIS top 40 listed on his own site

At the moment, the only solution I have to getting the top 40 onto my clients site is an iframe (not good I know!)
However, what's worse is that this top 40 list has horrendous colours that completely break the theme of my clients site and worse still, is Excel generated HTML! The author of this list is not a client of mine, and so, I can't do anything to persuade him to change. I've attached a copy of the offending file so you can see what I'm working with.
Just for clarification, the excel generated HTML file is not hosted on my server (although I could set up a cron job to get it if required)

I've looked at getting a CSV file into a PHP table which I believe can be done with
<?php
echo "<html><body><table>\n\n";
$f = fopen("so-csv.csv", "r");
while (($line = fgetcsv($f)) !== false) {
        echo "<tr>";
        foreach ($line as $cell) {
                echo "<td>" . htmlspecialchars($cell) . "</td>";
        }
        echo "<tr>\n";
}
fclose($f);
echo "\n</table></body></html>";


However, I obviously need a way to get the existing data into CSV format first!

Is there an easy way to strip all the rubbish info from a html page generated by Excel, leaving me with just the data so I can make it fit my clients existing theme?

Many thanks!
Iain

Attached File(s)


This post has been edited by techmonkey: 24 May 2011 - 11:22 PM


Is This A Good Question/Topic? 0
  • +

Replies To: Extract data from HTML file generated by Excel

#2 techmonkey   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 24-May 11

Re: Extract data from HTML file generated by Excel

Posted 25 May 2011 - 03:03 AM

OK,

I've done some more digging and found a useful bit of code

However, it displays the table 4 or 5 times
In addition to this, is there a way to manipulate the width of the columns with this code?

<?php 
$oldSetting = libxml_use_internal_errors( true ); 
libxml_clear_errors(); 

$html = new DOMDocument(); 
$html->loadHtmlFile('Chart%20Table2.htm'); 

$xpath = new DOMXPath( $html ); 
$elements = $xpath->query( "//table" ); 

foreach ( $elements as $item ) {
  $newDom = new DOMDocument;
  $newDom->appendChild($newDom->importNode($item,true));

  $xpath = new DOMXPath( $newDom ); 

  foreach ($item->attributes as $attribute) { 

    for ($node = $item->firstChild; $node !== NULL; 
         $node = $node->nextSibling) {
      if (($attribute->nodeName =='valign') && ($attribute->nodeValue=='top'))
      {
        print($node->nodeValue); 
      }
      else
      {
        print("<br>".$node->nodeValue);
      }
    }
    print("<br>");
  } 
}

libxml_clear_errors(); 
libxml_use_internal_errors( $oldSetting );



?>

Was This Post Helpful? 0
  • +
  • -

#3 JackOfAllTrades   User is offline

  • Saucy!
  • member icon

Reputation: 6259
  • View blog
  • Posts: 24,028
  • Joined: 23-August 08

Re: Extract data from HTML file generated by Excel

Posted 25 May 2011 - 03:33 AM

Exactly what data are you trying to extract from this file?
Was This Post Helpful? 0
  • +
  • -

#4 techmonkey   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 24-May 11

Re: Extract data from HTML file generated by Excel

Posted 25 May 2011 - 03:49 AM

Basically, I just want the table,
I'm not bothered about the first column that contains some pictures as I think that's a separate table

Ideally, I'd like to apply my own formatting to the table too (padding etc) if that is possible (eg from line 897
 <td class=xl28 x:num>1</td>


Down to line 1433

The trouble is, there is a lot of custom widths and styles inbetween which I don't want
I'd like all the tr and td's without all the associated guff from the original export from Excel

Geez I wish there was an easy way to get this guy just to export as CSV but unfortunately that's not an option!

Cheers for the response!
Was This Post Helpful? 0
  • +
  • -

#5 codeprada   User is offline

  • Changed Man With Different Priorities
  • member icon

Reputation: 963
  • View blog
  • Posts: 2,382
  • Joined: 15-February 11

Re: Extract data from HTML file generated by Excel

Posted 25 May 2011 - 03:56 AM

I've looked at the source and extracting the data is going to be very difficult because the TD's used to store the data don't hold a common attribute. On way you can prevent having to extract the data is by removing the styling with DOMDocument which will leave back the raw data.
Was This Post Helpful? 1
  • +
  • -

#6 techmonkey   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 24-May 11

Re: Extract data from HTML file generated by Excel

Posted 25 May 2011 - 03:59 AM

View Postcodeprada, on 25 May 2011 - 03:56 AM, said:

On way you can prevent having to extract the data is by removing the styling with DOMDocument which will leave back the raw data.

Sorry, but this is a little more advanced than I'm used to! How would I do that and use it to generate a new table?

Many thanks once again!
Iain
Was This Post Helpful? 0
  • +
  • -

#7 JackOfAllTrades   User is offline

  • Saucy!
  • member icon

Reputation: 6259
  • View blog
  • Posts: 24,028
  • Joined: 23-August 08

Re: Extract data from HTML file generated by Excel

Posted 25 May 2011 - 04:08 AM

There is a lot of table data in there. You would have to be more specific.
Was This Post Helpful? 0
  • +
  • -

#8 codeprada   User is offline

  • Changed Man With Different Priorities
  • member icon

Reputation: 963
  • View blog
  • Posts: 2,382
  • Joined: 15-February 11

Re: Extract data from HTML file generated by Excel

Posted 25 May 2011 - 04:12 AM

You won't have to generate a new table because the table is already generated.

Things like class, style, width, height are all attributes of a a node. What you would need to do is iterate through all the TD nodes and set these attributes to your liking (to fit your site) or remove them.

foreach($dom->getElementsByTagName('td') as &$td)
{
	$td->attributes->getNamedItem('class')->nodeValue = ''; //this clears the class attribute of that node
	//or you can set it to a class of your own
	$td->attributes->getNamedItem('class')->nodeValue = 'myClass';
	$td->attributes->getNamedItem('style')->nodeValue = ''; //clears it
}



Check out the DOMAttr class for more methods.
Was This Post Helpful? 1
  • +
  • -

#9 JackOfAllTrades   User is offline

  • Saucy!
  • member icon

Reputation: 6259
  • View blog
  • Posts: 24,028
  • Joined: 23-August 08

Re: Extract data from HTML file generated by Excel

Posted 25 May 2011 - 04:13 AM

Maybe your best bet might just be to override the CSS to theme it to your liking?
Was This Post Helpful? 1
  • +
  • -

#10 codeprada   User is offline

  • Changed Man With Different Priorities
  • member icon

Reputation: 963
  • View blog
  • Posts: 2,382
  • Joined: 15-February 11

Re: Extract data from HTML file generated by Excel

Posted 25 May 2011 - 04:19 AM

View PostJackOfAllTrades, on 25 May 2011 - 07:13 AM, said:

Maybe your best bet might just be to override the CSS to theme it to your liking?


Yep that's the plan. Override the CSS and make a class of his own then assign the class attribute of the TD nodes his class.
Was This Post Helpful? 0
  • +
  • -

#11 JackOfAllTrades   User is offline

  • Saucy!
  • member icon

Reputation: 6259
  • View blog
  • Posts: 24,028
  • Joined: 23-August 08

Re: Extract data from HTML file generated by Excel

Posted 25 May 2011 - 05:45 AM

I think this does what you're looking for:

<?php 

class ChartSingle
{
    function __construct($thisWeek, $lastWeek, $change, $artist, $title)
    {
        $this->thisWeek = $thisWeek;
        $this->lastWeek = $lastWeek;
        $this->change = $change;
        $this->artist = $this->quoteIt($this->removeNewlinesAndExtraSpaces($artist));
        $this->title = $this->quoteIt($this->removeNewLinesAndExtraSpaces($title));
    }
 
    function quoteIt($val)
    {
        return "\"$val\"";
    }

    function removeNewlinesAndExtraSpaces($val)
    {
        return preg_replace('#\s\s+#', ' ', str_replace(array("\r", "\n"), "", $val));
    }
    function __toString()
    {
        return implode(",", get_object_vars($this));
    }

    public $thisWeek;
    public $lastWeek;
    public $change;
    public $artist;
    public $title;
}

$oldSetting = libxml_use_internal_errors( true ); 
libxml_clear_errors(); 

$html = new DOMDocument(); 
$html->loadHtmlFile('chart_table.htm'); 

$sx = simplexml_import_dom($html);
$result = $sx->xpath("//table/tbody/tr[@class='xl34']");

foreach ($result as $node) {
    $item = new ChartSingle((int)$node->td[1], 
                            (int)$node->td[2], 
                            (string)$node->td[3], 
                            (string)$node->td[4], 
                            (string)$node->td[5]);
    $list[] = $item;
}

foreach ($list as $single)
  print $single . "\n";

?>


yielding:
php -f techmonkey.php
2,1,-1,"Bruno Mars","The Lazy Song"
11,7,-4,"Lady GaGa","The Edge Of Glory"
12,9,-3,"Jessie J","Nobody's Perfect"
13,8,-5,"Jennifer Lopez Feat. Pitbull","On The Floor"
18,14,-4,"Lady GaGa","Born This Way"
20,15,-5,"Adele","Someone Like You"
23,13,-10,"Katy Perry Feat. Kanye West","E.T."
24,19,-5,"Wretch 32 Feat. Example","Unorthodox"
26,20,-6,"Yasmin","Finish Line"
28,22,-6,"Nero","Guilt"
29,23,-6,"Mann Feat 50 Cent","Buzzin"
33,29,-4,"Cee Lo Green","Bright Lights, Bigger City"
34,30,-4,"Black Eyed Peas","Just Can't Get Enough"
35,24,-11,"Birdy","Skinny Love"
36,32,-4,"Tracy Chapman","Fast Car"
37,35,-2,"Jessie J Feat. B.o.B.","Price Tag"
40,36,-4,"Rihanna Feat. Britney Spears","S & M"

Was This Post Helpful? 0
  • +
  • -

#12 techmonkey   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 24-May 11

Re: Extract data from HTML file generated by Excel

Posted 25 May 2011 - 05:46 AM

View Postcodeprada, on 25 May 2011 - 04:12 AM, said:

You won't have to generate a new table because the table is already generated.

Things like class, style, width, height are all attributes of a a node. What you would need to do is iterate through all the TD nodes and set these attributes to your liking (to fit your site) or remove them.

foreach($dom->getElementsByTagName('td') as &$td)
{
	$td->attributes->getNamedItem('class')->nodeValue = ''; //this clears the class attribute of that node
	//or you can set it to a class of your own
	$td->attributes->getNamedItem('class')->nodeValue = 'myClass';
	$td->attributes->getNamedItem('style')->nodeValue = ''; //clears it
}



Check out the DOMAttr class for more methods.

Thanks!
Is that to insert into the first code snippet I posted or the second? I tried using it in the second and got

Cannot create references to elements of a temporary array expression

when I tried the following code

<?php  

echo "<html><body><table>\n\n";
$f = fopen("http://www.thesuperstation.info/Chart%20Table2.htm", "r");



while (($line = fgetcsv($f)) !== false) {
        echo "<tr>";
foreach($dom->getElementsByTagName('td') as &$td)
{
	$td->attributes->getNamedItem('class')->nodeValue = ''; //this clears the class attribute of that node
	//or you can set it to a class of your own
	$td->attributes->getNamedItem('class')->nodeValue = 'myClass';
	$td->attributes->getNamedItem('style')->nodeValue = ''; //clears it
}
        foreach ($line as $cell) {
                echo "<td>" . htmlspecialchars($cell) . "</td>";
        }
        echo "<tr>\n";
}
fclose($f);
echo "\n</table></body></html>";

 
  
?> 



I realise I'm asking an awful lot for someone who's only just registered and all your help is greatly appreciated
I've dabbled with PHP before and not too bad generally, just parsing files like this is way beyond me!

Thanks once again!

This post has been edited by techmonkey: 25 May 2011 - 05:47 AM

Was This Post Helpful? 0
  • +
  • -

#13 techmonkey   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 24-May 11

Re: Extract data from HTML file generated by Excel

Posted 25 May 2011 - 05:56 AM

Jack of all trades,

I've used your snippet and get an invalid argument on line 56
foreach ($list as $single)
  print $single . "\n";



Is this because the html file isn't on my server
Was This Post Helpful? 0
  • +
  • -

#14 codeprada   User is offline

  • Changed Man With Different Priorities
  • member icon

Reputation: 963
  • View blog
  • Posts: 2,382
  • Joined: 15-February 11

Re: Extract data from HTML file generated by Excel

Posted 25 May 2011 - 06:50 AM

View Posttechmonkey, on 25 May 2011 - 08:46 AM, said:

Cannot create references to elements of a temporary array expression


Replace this line
foreach($dom->getElementsByTagName('td') as &$td)
with
foreach($dom->getElementsByTagName('td') as $td)

Was This Post Helpful? 0
  • +
  • -

#15 techmonkey   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 24-May 11

Re: Extract data from HTML file generated by Excel

Posted 25 May 2011 - 06:59 AM

View Postcodeprada, on 25 May 2011 - 06:50 AM, said:

View Posttechmonkey, on 25 May 2011 - 08:46 AM, said:

Cannot create references to elements of a temporary array expression


Replace this line
foreach($dom->getElementsByTagName('td') as &$td)
with
foreach($dom->getElementsByTagName('td') as $td)


Hi Codeprada

I did that and now get

Call to a member function getElementsByTagName() on a non-object

I used to think I had an OK grounding in PHP but I think I need to get myself a decent book now after this attempt today!

Many thanks
Iain

This post has been edited by techmonkey: 25 May 2011 - 07:08 AM

Was This Post Helpful? 0
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2