1 Replies - 579 Views - Last Post: 16 February 2010 - 08:12 AM Rate Topic: -----

#1 s_kucksdorf  Icon User is offline

  • D.I.C Head

Reputation: 2
  • View blog
  • Posts: 69
  • Joined: 12-May 09

HTML to plain text.

Posted 14 February 2010 - 07:54 PM

Hello all. Last summer I created an online documentation website for the company I work for. One of the functionalities of this website is to click search and type in a search term, then postback to the server perform the search, and return all the documents that contains the text searched for. It works beautifully. However, what I would like to do instead of reading from a text document (the document's that are searched through) I would like to try and strip the markup of an HTML document. Ultimately what happens when the user clicks search is there is a full postback to the server and the server looks through the plain text views of the HTML documents. Then I strip the extension and replace it with .html (or .htm). This can cause errors of course if the document doesn't exist (404 errors). Is there any way to do this? I am required to use ASP.net VB (sorry for the CS fans). Thanks in advanced for any help! Happy coding!

Is This A Good Question/Topic? 0
  • +

Replies To: HTML to plain text.

#2 woodjom  Icon User is offline

  • D.I.C Addict
  • member icon

Reputation: 29
  • View blog
  • Posts: 549
  • Joined: 08-May 08

Re: HTML to plain text.

Posted 16 February 2010 - 08:12 AM

You might want to load the HTML DTD/XML design into a XSD style framework and parse the documents against known html controls and then it will extrapulate the CDATA value between the markups. I would have to do more research on this design but that would be the basic starting point for stripping HTML.

This is basically how Google and all the search engines cache websites, obviously theirs is a little more intuitive and does alot more but the basics are their.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1