3 Replies - 193 Views - Last Post: 24 April 2014 - 01:30 PM Rate Topic: -----

#1 ssmitty  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 1
  • Joined: 24-April 14

Need help with getting code to go from one web page to the next

Posted 24 April 2014 - 01:21 PM

I am totally a newbie with Java. I have a final assignment for a masters program in which the code has to scrape a web page for a little bit of data including the link to the next page, follow the link, and do this 100 times. It is scraping all the data correctly, including the link, but it isn't following the link to scrape the data on that next page. Instead, it is displaying the first page's data 100 times. I have the code in a while loop where it reads in one character at a time from the page into a string and then uses pattern matching to get the data and the next link from the string. Then it correctly displays the data and should loop back to connect to that next link and read one character at a time into the string and so on.

I have printed the link to the console and it is good. I don't get any errors. I just can't figure out where exactly the problem is. I have searched Google for topics related to this issue and haven't come across any new ideas on following links. As you can see, I have already written the code. I just need new eyes to help me find the problem and point me in the right direction. I hope it is something simple. Here is my code:

import java.awt.*;
import java.io.*;
import java.net.*;
import java.util.regex.*;
import javax.swing.* ;

public class SimpleWebSourceGetter{
   	
    static void getSourceCode(String url)
   {
     	String mystring = "";
    	String myranking = "1";
    	String mywriter = "";
    	String myproducer = "";
    	String myreleasedate = "";
    	String mytext = "";
   
    	int colstart = 0;
	int colend = 0;
     	int col1 = 0;
    	int col2 = 0;
    	int col3 = 0;
    	int col4 = 0;
    	int col5 = 0;
    	int col6 = 0;
    	int num = 0;

 	Pattern pattern1;
    	Pattern pattern2;
    	Pattern pattern3;
    	Matcher matcher1;
    	Matcher matcher2;
    	Matcher matcher3;
    	String s = "";
    	String s1 = "";
      	Boolean found = false;
    	
    while ( num < 100){	
        try  {
            //creating the URL
            URL pageURL = new URL(url);
            
            //Create the http url connection object
            HttpURLConnection urlConnection = (HttpURLConnection) pageURL.openConnection();
           
            //Reading the stream
            InputStream in = new BufferedInputStream(urlConnection.getInputStream());
            
            Reader r = new InputStreamReader(in);
            
            int c;
            while((c = r.read()) != -1)
            {
               mystring = mystring + String.valueOf((char)c);
            }// end reading loop
             
        }// end try block
        catch(MalformedURLException ex)
        {
            System.out.println(url + " is not a valid URL. Please enter a URL starting with http://");
        }// end catch for improper URL
        catch(IOException ie)
        {
            System.out.println("Error while reading: " + ie.getMessage());
        }// end catch for io reasons
        
    		
        // scrape the required data
        	try {  
        
            	pattern1 = Pattern.compile("og:description");
              	matcher1 = pattern1.matcher(mystring);
              	if (matcher1.find()) {
              		col1 =  mystring.indexOf("Writer:");
              		s = mystring.substring(col1);
              		col2 = s.indexOf("Producer:");
              		mywriter = s.substring(0, col2);
              		col3 = s.indexOf("Released:");
              		myproducer = s.substring(col2, col3);
              		col4 = s.indexOf(",");
              		myreleasedate = s.substring(col3, col4);
              		col5 = s.indexOf(".\" />");
              		col6 = col5 - 72;
              		mytext = s.substring(col6, col6 + 30);
  
              	} //end if
              	
         // find the next url
              	pattern2 = Pattern.compile("<div class=\"listPagination\">"); 
              	pattern3 = Pattern.compile("class=\"listPaginationControls");
               	matcher2 = pattern2.matcher(mystring);
              	matcher3 = pattern3.matcher(mystring);
              	
              	if (matcher2.find() && matcher3.find()) {
              	try  {
             		colstart = mystring.indexOf("<div class=\"listPagination\">");
  			s1 = mystring.substring(colstart);
 			s1 = s1.replaceAll("\\s+",  " ");
                  	colend = s1.indexOf("class=\"listPaginationControls");

           //creating the URL
          	
                    url = "http://www.rollingstone.com" + s1.substring(38, colend - 2); 
              	}//end try
              	catch (Exception e1 ) {
              	}//end catch
              	} //end if
              	
          //get the ranking
              	pattern1 = Pattern.compile("listPaginationText\">"); 
              	matcher1 = pattern3.matcher(s1);
              	pattern2 = Pattern.compile("</span>"); 
               	matcher2 = pattern2.matcher(s1);           	
              	if (matcher1.find() && matcher2.find()) {
              	try  {
             		colstart = s1.indexOf("listPaginationText\">");
  			colend = s1.indexOf("</span>");
                  	myranking = s1.substring(colstart + 20, colend);
                  	
          // print the data
               		System.out.print("Ranking: " + myranking + "  " + mywriter + " " +
 myproducer + " " + myreleasedate + " Text:  " + mytext + "\n");
               	    
              	}//end try
              	catch (Exception e1 ) {
              	}//end catch
              	} //end if
      	}//end try
   	catch (Exception e1 ) {
	}//end catch

       	num++;
      
   }//end while
   }// end getSourceCode method     
    
    
    public static void main (String[] args)
    {
		
	String url = "http://www.rollingstone.com/music/lists/the-500-greatest-s
ongs-of-all-time-20110407/bob-dylan-like-a-rolling-stone-20110516";
    getSourceCode(url);


    }  // end main method	
	

}// end class




Is This A Good Question/Topic? 0
  • +

Replies To: Need help with getting code to go from one web page to the next

#2 CasiOo  Icon User is online

  • D.I.C Lover
  • member icon

Reputation: 1390
  • View blog
  • Posts: 3,075
  • Joined: 05-April 11

Re: Need help with getting code to go from one web page to the next

Posted 24 April 2014 - 01:26 PM

You shouldn't be ignoring the exceptions thrown
The exceptions might actually be able to tell you what's going wrong :)
Was This Post Helpful? 0
  • +
  • -

#3 g00se  Icon User is online

  • D.I.C Lover
  • member icon

Reputation: 2675
  • View blog
  • Posts: 11,305
  • Joined: 20-September 08

Re: Need help with getting code to go from one web page to the next

Posted 24 April 2014 - 01:30 PM

You should look at the code for a (good) Java open source web crawler or spider. You will see that recursion is necessary (or better an own-stack based solution) and that using regex to parse links is not optimal when there are already proper html parsers available

What you've done is probably a good shot and will have been instructive but even more valuable is to learn how things are done properly

This post has been edited by g00se: 24 April 2014 - 01:31 PM
Reason for edit:: Clarification

Was This Post Helpful? 0
  • +
  • -

#4 modi123_1  Icon User is offline

  • Suitor #2
  • member icon



Reputation: 9059
  • View blog
  • Posts: 34,018
  • Joined: 12-June 08

Re: Need help with getting code to go from one web page to the next

Posted 24 April 2014 - 01:30 PM

You also should probably not be ignoring the terms of use for the website when you decide to go after making some scraper .

Quote

You further agree that you will not use any automated devices, such as spiders, robots or data mining techniques to catalog, download, store or otherwise reproduce, store or distribute Content or to manipulate the RS Applications or Services.

http://www.rollingst.../services/terms


With that being said I will close the topic and ask you do not persist in asking further questions about your TOS violating scraper here.

As always - if you have questions on the 'why' feel free to shoot me a PM.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1