Page 1 of 1

Parsing HTML in x64 assembly - Part I

#1 Martyn.Rae  Icon User is offline

  • The programming dinosaur
  • member icon

Reputation: 544
  • View blog
  • Posts: 1,412
  • Joined: 22-August 09

Posted 30 June 2017 - 11:12 AM

Introduction

It has been quite a while since I have submitted a tutorial here on DreamInCode, and as I am back from whatever I have been doing for the last 6 years, I thought it would be interesting to write a parser for HTML5 in x64 assembler.

Now this might seem a little strange at first as the browsers such as Firefox, Google Chrome, Microsoft Edge etc do a pretty good job. However, it is not my intention to write a browser (or is it?) but to purvey some quite interesting design and coding techniques during this process. Let's see how it goes!

Design Boxes

If we look at HTML5 source, we have a set of tags and end tags with text between. That would lead us to believe that a simple XML parser would do the trick and that would be relatively straight forward to program. A recursive routine would do the trick as follows in pseudo-code.

node structure
    next_node as node
    previous_node as node
    first_child_node as node
    last_child_node as node
end node structure

return_tag = parse_html with pointer to first_node and pointer to last_node
    1. read a new_tag
    2. if the new_tag is an end_of_file tag go to step 7
    3. if this is an end tag go to step 7
    4. create a new node with the name of the new_tag and chain it into the tree using the first_node and last_node on the call
    5. call parse_html with the new_tag, first_child_node and last_child_node of the node just created
    6. if the return tag from parse_html matches the new_tag go to step 1
    7. return current_tag
end parse_html



OK. Let's dry run the pseudo code with <HTML><HEAD><TITLE>Dry Run Test</TITLE></HEAD><BODY></BODY></HTML>.

(0) parse_html with a pointer to first node and pointer to last node
(1) Parse_html step 1 : read a new tag ... <HTML> is now our new tag
(1) Parse_html step 2 : is the new tag an end of file tag? ... no it's not so onto step 3
(1) Parse_html step 3 : is this an end of tag? ... no it's not so onto step 4
(1) Parse_html step 4 : create new node with data ... done!
(1) Parse_html step 5 : call self with pointer to new node first_child_node and pointer to new node last_child_node
(2) Parse_html step 1 : read a new tag ... <HEAD> is now our new tag
(2) Parse_html step 2 : is the new tag an end of file tag? ... no it's not so onto step 3
(2) Parse_html step 3 : is this an end of tag? ... no it's not so onto step 4
(2) Parse_html step 4 : create new node with data ... done!
(2) Parse_html step 5 : call self with pointer to new node first_child_node and pointer to new node last_child_node
(3) Parse_html step 1 : read a new tag ... <TITLE> is now our new tag
(3) Parse_html step 2 : is the new tag an end of file tag? ... no it's not so onto step 3
(3) Parse_html step 3 : is this an end of tag? ... no it's not so onto step 4
(3) Parse_html step 4 : create new node with data ... done!
(3) Parse_html step 5 : call self with pointer to new node first_child_node and pointer to new node last_child_node
(4) Parse_html step 1 : read a new tag ... </TITLE> is now our new tag
(4) Parse_html step 2 : is the new tag an end of file tag? ... no it's not so onto step 3
(4) Parse_html step 3 : is this an end of tag? ... yes it is not so onto step 7
(4) Parse_html step 7 : return </TITLE>
(3) Parse_html step 1 : read a new tag ... </HEAD> is now our new tag
(3) Parse_html step 2 : is the new tag an end of file tag? ... no it's not so onto step 3
(3) Parse_html step 3 : is this an end of tag? ... yes it is not so onto step 7
(3) Parse_html step 7 : return </HEAD>
(2) Parse_html step 6 : does new tag <HEAD> and returned tag </HEAD> match? ... yes they do so onto step 1
(2) Parse_html step 1 : read a new tag ... <BODY> is now our new tag
(2) Parse_html step 2 : is the new tag an end of file tag? ... no it's not so onto step 3
(2) Parse_html step 3 : is this an end of tag? ... no it's not so onto step 4
(2) Parse_html step 4 : create new node with data ... done!
(2) Parse_html step 5 : call self with pointer to new node first_child_node and pointer to new node last_child_node
(3) Parse_html step 1 : read a new tag ... </BODY> is now our new tag
(3) Parse_html step 2 : is the new tag an end of file tag? ... no it's not so onto step 3
(3) Parse_html step 3 : is this an end of tag? ... yes it is not so onto step 7
(3) Parse_html step 7 : return </BODY>
(2) Parse_html step 1 : read a new tag ... </HTML> is now our new tag
(2) Parse_html step 2 : is the new tag an end of file tag? ... no it's not so onto step 3
(2) Parse_html step 3 : is this an end of tag? ... yes it is not so onto step 7
(2) Parse_html step 7 : return </HTML>
(1) Parse_html step 6 : does new tag <HTML> and returned tag </HTML> match? ... yes they do so onto step 1
(1) Parse_html step 1 : read a new tag ... end of file is now our new tag
(1) Parse_html step 2 : is the new tag an end of file tag? ... yes it is so onto step 7
(1) Parse_html step 7 : return end of file tag
(0) Completed parse

I am fairly happy with that pseudo-code, but notice how I skipped over the title text that sit's between the <TITLE> and </TITLE> tags. I am going to worry about that later on (says he brushing it under the carpet) as we just need to prove the overall design concept works with the flow of data.

Self-Terminating Tags

The HTML5 specification states that we have a number of self-terminating tags such as <BR> and <META>. This can be easily handled by adding an extra pseudo-code line between step 4 and step 5 such that we do not execute a recursive call for these self_terminating tags. All we need to do is simply go back to read the next tag.

node structure
    next_node as node
    previous_node as node
    first_child_node as node
    last_child_node as node
 end node structure

return_tag = parse_html with pointer to first_node and pointer to last_node
    1. read a new_tag
    2. if the new_tag is an end_of_file tag go to step 8
    3. if this is an end tag go to step 8
    4. create a new node with the name of the new_tag and chain it into the tree using the first_node and last_node on the call
    5. if new tag is self_terminating then go to 1
    6. call parse_html with the new_tag, first_child_node and last_child_node of the node just created
    7. if the return tag from parse_html matches the new_tag go to step 1
    8. return current_tag
end parse_html



Part II of this tutorial will complete the pseudo code and will be arriving here very soon! Keep an eye out for it!!!

Is This A Good Question/Topic? 1
  • +

Page 1 of 1