6 Replies - 429 Views - Last Post: 08 March 2013 - 08:39 PM Rate Topic: -----

#1 coderall  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 10
  • Joined: 15-February 13

Extract Every Link in HTML Page Problem

Posted 08 March 2013 - 07:52 PM

I have my program running for the user to enter in a url which then takes that and extracts all the links from that page and print them. However, I can only get the first link to print. Can someone please explain why the other links aren't appearing as well? Basically what my program is doing is showing the html page source code and taking the first link it sees and continually loops only that one. I created a loop to check every part of the code for links ending with these. It doesn't go on to the next link into the page source and extract that (assuming all links end in doc, txt, or pdf)

  String parsedPage = Fetch.fetchURL(strURL);
	String startingLink = "href=\"";
	String endingLink = "\"";
	int position = parsedPage.indexOf(startingLink);
	int startOfURL = position + startingLink.length();
	int endOfURL = parsedPage.indexOf("\"", startURL);							
	String webLink = parsedPage.substring(startOfURL, endOfURL);

         for(int i = 0; i<= webpage.length(); i++){
	   if((webLink.endsWith(".doc"))||
              (webLink.endsWith(".txt"))||
              (webLink.endsWith(".pdf"))){
	          System.out.println(webLink);
	   }
         }




Is This A Good Question/Topic? 0
  • +

Replies To: Extract Every Link in HTML Page Problem

#2 pbl  Icon User is offline

  • There is nothing you can't do with a JTable
  • member icon

Reputation: 8324
  • View blog
  • Posts: 31,857
  • Joined: 06-March 08

Re: Extract Every Link in HTML Page Problem

Posted 08 March 2013 - 08:01 PM

what is webpage ?
With a length() method can be a String
so why <= that length ?

And what is weblink ? How can the tests on weblink.endsWith() produce different results within the loop ? The loop changes i from 0 to string.length??? And how would that change welink ?

So why the tests on weblink within that loop... that probably does not crash on <= because you do not use i :)
Was This Post Helpful? 0
  • +
  • -

#3 coderall  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 10
  • Joined: 15-February 13

Re: Extract Every Link in HTML Page Problem

Posted 08 March 2013 - 08:11 PM

View Postpbl, on 08 March 2013 - 08:01 PM, said:

what is webpage ?
With a length() method can be a String
so why <= that length ?

And what is weblink ? How can the tests on weblink.endsWith() produce different results within the loop ? The loop changes i from 0 to string.length??? And how would that change welink ?

So why the tests on weblink within that loop... that probably does not crash on <= because you do not use i :)/>


whoops webpage should say parsedPage. parsedPage is used to represent the whole html code of the webpage.

Weblink is used to represent the different links in the html code of the webpage. For this piece of my code, the user wants to see all document type links of the webpage (in this case .doc, .txt, and .pdf). That's why I have the if statement showing that it only has to meet one of those requirements.

Does that clear things up a bit?
Was This Post Helpful? 0
  • +
  • -

#4 flareback  Icon User is offline

  • New D.I.C Head

Reputation: 3
  • View blog
  • Posts: 30
  • Joined: 27-February 13

Re: Extract Every Link in HTML Page Problem

Posted 08 March 2013 - 08:11 PM

Hard to say without knowing what the webpage variable is in the loop counter.

Besides that I don't see how weblink is going to ever be different. It gets set before the loop and only gets printed, never set to a new value.

Side note:
is there a reason you don't look for the html tag <a> to check for links. There very well maybe but I'm just curious. I guess the <a> tag would only give you clickable links.
Was This Post Helpful? 0
  • +
  • -

#5 pbl  Icon User is offline

  • There is nothing you can't do with a JTable
  • member icon

Reputation: 8324
  • View blog
  • Posts: 31,857
  • Joined: 06-March 08

Re: Extract Every Link in HTML Page Problem

Posted 08 March 2013 - 08:15 PM

View Postcoderall, on 08 March 2013 - 10:11 PM, said:

Does that clear things up a bit?

Not really... how does weblink is updated in the loop... and thus what is the use of the loop ?
Was This Post Helpful? 0
  • +
  • -

#6 coderall  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 10
  • Joined: 15-February 13

Re: Extract Every Link in HTML Page Problem

Posted 08 March 2013 - 08:20 PM

View Postflareback, on 08 March 2013 - 08:11 PM, said:

Hard to say without knowing what the webpage variable is in the loop counter.

Besides that I don't see how weblink is going to ever be different. It gets set before the loop and only gets printed, never set to a new value.

Side note:
is there a reason you don't look for the html tag <a> to check for links. There very well maybe but I'm just curious. I guess the <a> tag would only give you clickable links.


Well I look for href= to check for links.


View Postpbl, on 08 March 2013 - 08:15 PM, said:

View Postcoderall, on 08 March 2013 - 10:11 PM, said:

Does that clear things up a bit?

Not really... how does weblink is updated in the loop... and thus what is the use of the loop ?


That's my problem. I'm trying to go through each link in the html page source code and check to see if it is a .doc, .txt or .pdf. If it's not, skip. If it is, print it and go to the next link. I figured a loop would help me go to the next link if I set it right.
Was This Post Helpful? 0
  • +
  • -

#7 pbl  Icon User is offline

  • There is nothing you can't do with a JTable
  • member icon

Reputation: 8324
  • View blog
  • Posts: 31,857
  • Joined: 06-March 08

Re: Extract Every Link in HTML Page Problem

Posted 08 March 2013 - 08:39 PM

Your code seems over but over complicated to me
but probably you want

String webLink = parsedPage.substring(startOfURL, endOfURL);

within the loop... and don;y forget to update startOfURL and endOfURL within the loop

P.S.
If I had to code that, I would start by extracting in an array of String everything between < and > :... so maybe:

String[] token = parsedPage.split("<");
not really interested by your problem not really challenging but this is what I would do
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1