4 Replies - 2047 Views - Last Post: 20 January 2013 - 06:21 PM Rate Topic: -----

#1 BJseal91  Icon User is offline

  • D.I.C Head

Reputation: 2
  • View blog
  • Posts: 68
  • Joined: 18-April 09

How To Remove HTML Code To View Just Text

Posted 20 January 2013 - 03:04 PM

Team,
I am under going research how to delete the HTML Jargan so you are left with just the text from a website,
is this something that can be done or is it not doable I can display the text of a website but with a load of HTML as well I just want the text not the code of a webpage, Hope this makes sence.

Code to display website information on webpage,

Imports System.Net
Imports System.IO

Public Class form1
    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        TextBox2.Text = getHTML(TextBox1.Text.ToString)
    End Sub
    Private Function getHTML(ByVal address As String) As String
        Dim RT As String = ""
        Dim WRequest As WebRequest
        Dim WResponse As WebResponse
        Dim SR As StreamReader
        WRequest = WebRequest.Create(address)
        WResponse = WRequest.GetResponse
        SR = New StreamReader(WResponse.GetResponseStream)
        RT = SR.ReadToEnd
        SR.Close()
        Return RT
    End Function
End Class


Kind Regards

Is This A Good Question/Topic? 0
  • +

Replies To: How To Remove HTML Code To View Just Text

#2 andrewsw  Icon User is online

  • Fire giant boob nipple gun!
  • member icon

Reputation: 3371
  • View blog
  • Posts: 11,420
  • Joined: 12-December 12

Re: How To Remove HTML Code To View Just Text

Posted 20 January 2013 - 03:43 PM

There may be a package to do this, otherwise here are two alternatives:

Imports mshtml    'add a reference to this
    Function textFromHtml(ByVal htmlToParse As String) As String
        Dim htmlDocument As IHTMLDocument = New HTMLDocument
        Dim sCollect As String = ""

        htmldocument.write(htmlToParse)
        htmldocument.close()

        Dim allElements As IHTMLElementCollection = htmldocument.body.all

        For Each elem As IHTMLElement In allElements
            sCollect += elem.innerText
        Next

        Return sCollect
    End Function

    Public Function stripTags(ByVal htmlToParse As String) As String
        Return Text.RegularExpressions.Regex.Replace(htmlToParse, "<[^>]*>", "")
    End Function


Could use a StringBuilder.

This post has been edited by andrewsw: 20 January 2013 - 03:49 PM

Was This Post Helpful? 0
  • +
  • -

#3 andrewsw  Icon User is online

  • Fire giant boob nipple gun!
  • member icon

Reputation: 3371
  • View blog
  • Posts: 11,420
  • Joined: 12-December 12

Re: How To Remove HTML Code To View Just Text

Posted 20 January 2013 - 04:59 PM

        Dim allElements As IHTMLElementCollection = htmldocument.body.all
        Dim sTags() As String = {"P", "DIV", "SPAN", "H1", "H2", "H3"}
        For Each elem As IHTMLElement In allElements
            Dim sTagUpper As String = elem.tagName.ToUpper()
            If sTags.Contains(sTagUpper) Then
                sCollect += elem.innerText
                If sTagUpper <> "SPAN" Then
                    sCollect += Constants.vbCrLf
                End If
            End If
        Next

Was This Post Helpful? 0
  • +
  • -

#4 andrewsw  Icon User is online

  • Fire giant boob nipple gun!
  • member icon

Reputation: 3371
  • View blog
  • Posts: 11,420
  • Joined: 12-December 12

Re: How To Remove HTML Code To View Just Text

Posted 20 January 2013 - 05:05 PM

I think it should be possible to read the page-content as HTML, rather than as text and then converting it to HTML, and back to text(?).
Was This Post Helpful? 0
  • +
  • -

#5 andrewsw  Icon User is online

  • Fire giant boob nipple gun!
  • member icon

Reputation: 3371
  • View blog
  • Posts: 11,420
  • Joined: 12-December 12

Re: How To Remove HTML Code To View Just Text

Posted 20 January 2013 - 06:21 PM

Sorry, on a mission now..

        Dim sCollect As String = ""

        browser = New System.Windows.Forms.WebBrowser()
        'AddHandler browser.DocumentCompleted, AddressOf DocLoaded

        browser.Navigate("http://allenbrowne.com")
        Do While browser.ReadyState <> System.Windows.Forms.WebBrowserReadyState.Complete
            ' need pause/sleep here
            Application.DoEvents()
            Console.WriteLine(browser.ReadyState.ToString())
        Loop
        Console.WriteLine("No longer busy..")
        Dim elems As System.Windows.Forms.HtmlElementCollection = browser.document.Body.All
        browser.Dispose()
        Dim sTags() As String = {"P", "DIV", "SPAN", "H1", "H2", "H3"}

        For Each elem As System.Windows.Forms.HtmlElement In elems
            Dim sTagUpper As String = elem.TagName.ToUpper()
            If sTags.Contains(sTagUpper) Then
                sCollect += elem.InnerText
                If sTagUpper <> "SPAN" Then
                    sCollect += Constants.vbCrLf
                End If
            End If
        Next
        Console.WriteLine(sCollect)

[I'm running from a Console which is why there are references to System.Windows.Forms]

Quite interesting this .Net stuff :clap:

Because elem is (within the loop) an HTML element we can examine things like its id, style and attribute info.

This post has been edited by andrewsw: 20 January 2013 - 06:35 PM

Was This Post Helpful? 0
  • +
  • -

Page 1 of 1