Extracting raw data from HTML using PHP

via <a href="https://commons.wikimedia.org/wiki/File:Open_Data_stickers.jpg" target="_blank">Wikimedia Commons</a>

A friend I worked with a few jobs ago recently contacted me about doing some freelance work for him. He's collecting data, usually contained on governmental websites, for use in a project he's working on. It's the kind of work I did in a previous job and he provided an initial framework to use for this project. In this framework he was using the PHP library simple_html_dom. As I worked on this project I found that the library was not the best solution and switched over to using PHP's built-in DOM libraries. In this posting, I'll go over the issues with both simple_html_dom and PHP's built-in libraries.

The use of simple_html_dom was made years ago and has largely worked so it never got fixed, but it does have it's good points. The library is relatively easy to learn and the documentation on it's website is understandable. If you are dealing with HTML or XML files that are malformed (such as missing closing or opening tags), the library is able to work with it. While, simple_html_dom can allow for quickly writing code to parse HTML documents, there are downsides. There is a defect with the library that results in memory leaks if a script repeatedly parses multiple documents. It is a known issue and the documentation includes information on how to mitigate the memory leak. It is not in active development, so the memory leak and any other issues that may pop up will never be fixed.

Another major issue that developed in my latest project is that the library is very inefficient. Some of the documents I am needing to parse are multiple megs in size. This is just textual data and HTML markup, not CSS, JavaScript, images or other web bloat. When simple_html_dom processes an 1MB file, it can take up 800MB of RAM and requires a lot of CPU cycles to perform actions with it. I wanted to process files without the overhead and a quicker runtime. PHP does include built-in libraries for parsing HTML and XML documents which turns out to be much more efficient than simple_html_dom, which predates the built-in libraries. Using the built-in libraries can feel more convoluted though. I'll try to go through the hoops I had to jump through.

Not all web pages have the correct HTML markup. In the case of the project I've been working on, data on the web page is provided in tables (<table>). The table rows have the correct closing tags (</tr>) but, for some reason, are missing the opening tags (<tr>) other than the first one in the table. This and other issues with the HTML markup means some cleanup needs to be done to the page source. PHP includes a version of HTML Tidy as a built-in library. Once I download the HTML source for the page I'm attempting to parse, I run it through Tidy:

$tidy = new Tidy();
$tidy->parseString($page_src);
$tidy->cleanRepair();

The next step is to turn the (tidied) HTML page source into a DOM document. PHP's DOM Document library will take any XML compliant markup (including HTML) and build a tree structure out of it. The DOM library includes various classes to navigate and access elements in the tree. One of these classes is DOMXPath. To take the tidied HTML and build a DOM document traversable with XPath the following steps are needed:

$html_dom = new DOMDocument();
$html_dom->loadHTML($tidy);
$xpath = new DOMXPath($html_dom);

Xpath provides a query language so that you can navigate a DOM tree and find specific elements. Let's walk through how to access elements of a table:

$nodeList = $xpath->query("body/table[1]")->item(0)

The query "body/table[1]" looks for the first <table> in the <body> of the HTML document. (NOTE: Indexes in a Xpath query start at 1) Since there is only one first table, the query will return a list of nodes with a length of one. To retrieve the node containing the table, "item(0)" is used. (We're back to PHP indexes which start from 0) Once the instruction has been executed, $nodeList will contain information about the first <table> in the document.

To iterate through the rows of the table we'll use:

foreach ($nodeList->childNodes as $tr_num => $tr)

The immediate children of a <table> will be <tr> (table rows). Iterating through the childNodes of a <table> gives us each of the <tr> nodes (and their subsequent children). We can now iterate through the columns of the table row:

foreach ($tr->childNodes as $td)

There is an important issue to consider when iterating through a table row. The immediate children of <tr> may not be only <td> elements. There may also be elements named "#text". (There may also be other elements depending on the HTML source and this issue may not be limited to just the table row. Something to pay careful attention to with other HTML pages.)

I'm not concerned about the #text elements, just the <td> nodes and its contents. So I do a check to make sure the name of the node ($td) is <td>:

if (strpos(strtolower($td->nodeName), "td") === FALSE) continue;

Now that I know the $td variable contains a <td> node, I can get to the contents of the column:

$contents = $td->nodeValue;

You can extrapolate from there how to access other information within an HTML document. Again, using PHP's built-in libraries can seem more convoluted and more complicated than simple_html_dom. However, it more than makes up for itself in efficiency. The built-in libraries need less than half the amount of RAM and is significantly faster than simple_html_dom.

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.