Page 1 of 1

Parsing HTML in x64 assembly - Part II

#1 Martyn.Rae  Icon User is offline

  • The programming dinosaur
  • member icon

Reputation: 544
  • View blog
  • Posts: 1,413
  • Joined: 22-August 09

Posted 05 July 2017 - 10:00 AM

Before we start delving into part II of this awesome topic, there are just a few little corrections to what we have covered so far.
Firstly, we need the node structure to contain the html tag identifier, the attributes list and also the text following an html tag (the '<title>Title Text</title>' rears its ugly head even though I had so gracefully hidden it under the carpet! So the complete structure looks like:-

node structure
    next_node as node
    previous_node as node
    first_child_node as node
    last_child_node as node
    html_tag as word integer
    attributes as string
    text_following_tag as string
end node structure



We also need to make a small adjustment to our pseudo-code such that the node creation occurs before step 1 (read a new tag). The chaining of the newly created node remains as is. This gives us:-

return_tag = parse_html with pointer to first_node and pointer to last_node
    0. create a new node
    1. read a new_tag
    2. if the new_tag is an end_of_file tag go to step 9
    3. if this is an end tag go to step 9
    4. set the name of the new_tag, it's attributes and text following into the node
    5. chain the node into the tree using the first_node and last_node on the call
    6. if new tag is self_terminating then go to 0
    7. call parse_html with the new_tag, first_child_node and last_child_node of the node just created
    8. if the return tag from parse_html matches the new_tag go to step 0
    9. return current_tag
end parse_html



Now to consider steps 0 and 1 a little more deeply. As you may of noticed in my previous tutorials I have a natural hatred of allocations on a heap (i.e. C++ 'operator new', or C 'malloc' type routines), particularly when I am writing in assembly. After all you don't expect a surgeon performing an operation using a chisel do you? No, they would use a laser scalpel. Same here, assembly is a very precise programming tool that can be crafted to perform magnificent tasks in the right hands!!!

Beautiful Pointers

Unlike high level languages, assembly programming does not perform any type checking on data. That is up to you as the programmer. So taking a simple set of code lines:-

               .data
Text           db         '12345678'
               .code
               lea        rsi, Text            ; Load Text address into rsi
               mov        al, byte ptr [rsi]   ; load '1' into al
               mov        ax, word ptr [rsi]   ; load '21' into ax
               mov        eax, dword ptr [rsi] ; load '4321' into eax
               mov        rax, qword ptr [rsi] ; load '87654321' into rax



So, the code above states that we have a register (rsi) and we load the address of a memory location (Text). The next line states that we want to load the first 8 bits of whatever is contained at that memory location into the al register.
The second line states that we want to load the first 16 bits of whatever is contained at that same memory location into the ax register. The third line states that we want to load the first 32 bits of whatever is contained at that same memory location into the eax register. And finally the last line states that we want to load the first 64 bits of whatever is contained at that same memory location into the rax register.

Hopefully, I haven't lost you yet. Now take a look at my comments next to the instructions. The first mov instruction loads the first 8 bits into the al register. That makes sense I hope (in hexadecimal ascii character '1' is 031H). The next mov instruction loads the first 16 bits into the ax register. This still loads '1' (031H) into bits 7-0 of the ax register then loads '2' (032H) into bits 15-8 of the ax register. This may seem a little confusing at first especially if you are unaware of endianness as the intel processor works on the little endian principle not the big endian (please refer to Wikipedia topic entitles 'endianness' found here). I will not bother explaining the last two mov statements - needless to say though, they load '4321' into eax (hexadecimal 034333231H) and '87654321' into rax (hexadecimal 03837363534333231H).

Work-space Memory

Now we have digested that, if we allocate a large enough chunk of memory that can contain all our nodes and text strings, we can use a pointer to get and put data into that chunk of memory providing we have an address and what type of data we are retrieving or storing. Now the question is how much memory should we allocate (using VirtualAlloc). Well we could be very scientific and parse our input data twice - once to compute the number of nodes and the size of data strings, or we could compute an approximation. For example, the smallest tag we have is a single character tag such as <a> and it's end tag </a> - a total of 7 characters. Now if we say that if we have a 14000 byte source, then we could not have no more than 2000 nodes (<a></a> repeated 2000 times gives 14000 bytes). For the text strings, if we say we have 14000 'a' characters between our <a></a> tags we would have 14000 'a' characters and 14000 null characters (string terminators) which gives us 28000 bytes. Our node structure is 50 bytes long giving us a total of (2000x50)+28000 = 128000 bytes of data. We could refine that computation even more but I personally don't think it is worth it!

Using registers

It might seem a little to early to start assigning pointers to registers, but it will help us in writing the pseudo-code for the processing of the tags. We will assume that the initial call to the parse_tag routine will have the r15 register containing the current address of the html source and the r14 register will contain the address of the first free byte in our work-space.

The pseudo-code for steps 0 and 1

0. Move r15 to rsi register (the current html source position)
1. Move r14 to rdi register (the address of the new node)
2. Add sizeof node to r15 (will now point to the free space
3. Load the next four bytes from the source into eax register
4. If the character in the al register is a '<' character then go to step 7
5. Increment rsi
6. Go to step 3
7. If the characters in the eax register are NOT equal to the string '--!>' (little endian remember) then go to step 16
8. Add 4 to rsi
9. Load the next four bytes from the source into eax register
10. Clear bits 31-24 of the eax register
11. If the characters in the eax register are equal to the string '<--' then go to step 14
12. Add 1 to rsi
13. Go to step 9
14. Add 3 to rsi
15. Go to step 3
16. If the characters in the eax register are NOT equal to 'OD!>' then go to step 34
17. Add 4 to rsi
18. Load the next eight bytes from the source into rax register
19. Clear bits 63-48 of the rax register
20. If the characters in the rax register are NOT equal to ' EPYTC' then go to step 28
21. Add 6 to rsi
22. Load the next byte into the al register
23. Is the al register equal to a space (020H) character ... if no go to step 26
24. Add 1 to rsi
25. Go to step 22
26. Load the next four bytes from the source into eax register
27. If the characters in the eax register are equal to the string 'LMTH' then go to step 29
28. Eeeks - this is not an HTML5 document --- DIE
29. Add 4 to rsi
30. Load the next byte into the al register
31. Is al register equal to a '>' ... if yes then go to step 3
32. Add 1 to rsi
33. Go to step 30
34. If the characters in the ax register are NOT equal to '/>' then go to step 42
35. Add 2 to rsi
36. Lookup html tag (positions rsi on first non-alphanumeric character)
37. Load the next byte into the al register
38. Is al register equal to a '>' ... if yes then go to step 41
39. Add 1 to rsi
40. Go to step 37
41. Return html tag
42. Add 1 to rsi
43. Lookup html tag and set in node (pointed to by rdi)
44. Load the next byte into the al register
45. Is the al register equal to a space (020H) character ... if no go to step 48
46. Add 1 to rsi
47. Go to step 44
48. Is al register equal to a '>' ... if yes then go to step 75
49. Put the rdi pointer into the node as the attributes pointer
50. Move the al register to the rdi address
51. Add 1 to rdi
52. Add 1 to rsi
53. Load the next byte into the al register
54. Is al register equal to a '>' ... if yes then go to step 75
55. Is the character in al a '"' (022H) character ... if yes then go to step 58
56. Is the character in al a ''' (027H) character ... if yes then go to step 64
57. Go to step 48
58. Move the al register to the rdi address
59. Add 1 to rdi
60. Add 1 to rsi
61. Load the next byte into the al register
62. Is the character in al a '"' (022H) character ... if yes then go to step 70
63. Go to step 58
64. Move the al register to the rdi address
65. Add 1 to rdi
66. Add 1 to rsi
67. Load the next byte into the al register
68. Is the character in al a ''' (027H) character ... if yes then go to step 70
69. Go to step 64
70. Move the al register to the rdi address
71. Add 1 to rdi
72. Add 1 to rsi
73. Load the next byte into the al register
74. Go to step 48
75. Add 1 to rdi
76. Add 1 to rsi
77. Put the rdi pointer into the node as the text_following_tag pointer
78. Load the next byte into the al register
79. Is the character in al a '<' (03CH) character ... if yes then go to step 84
80. Move the al register to the rdi address
81. Add 1 to rdi
82. Add 1 to rsi
83. Go to step 78
84. Chain the node into the tree using the first_node and last_node on the call
85. If new tag is self_terminating then go to 0
86. Call parse_html with the new_tag, first_child_node and last_child_node of the node just created
87. If the return tag from parse_html matches the new_tag go to step 0
88. Something has gone wrong!!!!

Proving the Code

As we have now got many steps to check, it is going to be easier to write the assembly code and test it. Before we can do that however, we need to consider steps 36 and 43 which in our pseudo-code is a mechanism to lookup the html tag characters. This will be covered in Part III. Exciting isn't it!!!

This post has been edited by Martyn.Rae: 05 July 2017 - 11:17 AM


Is This A Good Question/Topic? 0
  • +

Page 1 of 1