11 Replies - 1231 Views - Last Post: 27 February 2011 - 08:23 AM Rate Topic: -----

#1 patticus73  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 9
  • Joined: 22-February 11

Adding any suffix to a base URL

Posted 26 February 2011 - 05:26 PM

I'm trying to build a basic data scraper (I'm a noob!), but am having trouble with creating code that will scrape the website I'm trying to look at:

Let's say I'm trying to iterate through these three URLs:

www.examplescraper.com/fghxbvn/17901234.html
www.examplescraper.com/fghxbvn/17911102.html
www.examplescraper.com/fghxbvn/17921823.html

Each link is exactly the same except for a year (1790, 1791, 1792) and then a random number at the end, so I want to set up a beautiful soup scraper like so:

base = "www.examplescraper.com/fghxbvn/"
year = 1790
rand = "????.html"

scrape= base + year + rand

year+=1

QUESTION: How would I signify "any four numbers + html" for the above "rand" variable?

Any and all help is greatly appreciated!

Patrick

Is This A Good Question/Topic? 0
  • +

Replies To: Adding any suffix to a base URL

#2 poncho4all  Icon User is offline

  • D.I.C Head!
  • member icon

Reputation: 123
  • View blog
  • Posts: 1,405
  • Joined: 15-July 09

Re: Adding any suffix to a base URL

Posted 26 February 2011 - 05:33 PM

Would you elaborate on the 2nd question?

and for the year you would have to do something like
scrape = base + str(year) + rand


i think thats a way to do it otherwise read this in the string formating operations

is this what you needed?

This post has been edited by poncho4all: 26 February 2011 - 05:47 PM

Was This Post Helpful? 0
  • +
  • -

#3 patticus73  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 9
  • Joined: 22-February 11

Re: Adding any suffix to a base URL

Posted 26 February 2011 - 05:52 PM

I want t iterate from 1790 to 2011 (years, such as in the example), but after the year there's extra numbers that do not follow a pattern and a ".html"

If you look through the three urls provided as an example you can see that there's the base, then the year (1790, 1791, 1792) and then 4 random digits and a ".html". I want to iterate through the years, but append somehow that python should look for any four digits and the ".html"


SO,

it would look like this: base + year (which is iterating in a loop) + "any four random digits" + ".html"

Is there a way to do this?

To establish my root question:

url = base + str(year) + any four numbers + ".html"

How do I signify "any four numbers"?
Was This Post Helpful? 0
  • +
  • -

#4 poncho4all  Icon User is offline

  • D.I.C Head!
  • member icon

Reputation: 123
  • View blog
  • Posts: 1,405
  • Joined: 15-July 09

Re: Adding any suffix to a base URL

Posted 26 February 2011 - 06:00 PM

Well you could do some random for the four digits and add them to the string the same way as you add the year in the example i gave you above.

if you have limits for the random you could do something like

base = "www.examplescraper.com/fghxbvn/"
year = 1970
rand = random.randint(lowerbound, upperbound)

#lower bound could be 1000 and upper bound 9999 taking that
#from the example all of the random number start a one thousand #......

last = ".html"

scrape = base + str(year) + str(rand) + last



i think that could work
Was This Post Helpful? 0
  • +
  • -

#5 patticus73  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 9
  • Joined: 22-February 11

Re: Adding any suffix to a base URL

Posted 26 February 2011 - 06:07 PM

But the "random digits" aren't exactly random, meaning in the three urls I used for the example, the last four digits after 1790 are always "1234" and the last four for 1791 are always "1102."

The only problem is that I want to iterate through a couple hundres urls, otherwise I would just write down the last four digits or the whole urls for that matter.

I there any way to not GENERATE four random digits, but rather to RECOGNIZE four random digits?
Was This Post Helpful? 0
  • +
  • -

#6 poncho4all  Icon User is offline

  • D.I.C Head!
  • member icon

Reputation: 123
  • View blog
  • Posts: 1,405
  • Joined: 15-July 09

Re: Adding any suffix to a base URL

Posted 26 February 2011 - 06:12 PM

Recognize from where?
Was This Post Helpful? 0
  • +
  • -

#7 patticus73  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 9
  • Joined: 22-February 11

Re: Adding any suffix to a base URL

Posted 26 February 2011 - 06:24 PM

Recognized from right before the .html and right after the year

SO

url = base + str(year) + RECOGNIZE any four digits + ".html"
Was This Post Helpful? 0
  • +
  • -

#8 poncho4all  Icon User is offline

  • D.I.C Head!
  • member icon

Reputation: 123
  • View blog
  • Posts: 1,405
  • Joined: 15-July 09

Re: Adding any suffix to a base URL

Posted 26 February 2011 - 06:35 PM

19701234
19711102
19721823

is there no

19701235?

and

the next number

1973????

is there a patter to follow? so that you dont have to guess but to select?
Was This Post Helpful? 0
  • +
  • -

#9 patticus73  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 9
  • Joined: 22-February 11

Re: Adding any suffix to a base URL

Posted 26 February 2011 - 06:45 PM

The list goes

1790 + four random numbers
1791 + four random numbers
1792 + four random numbers
1793 + four random numbers
1794 + four random numbers
1795 + four random numbers
1796 + four random numbers
1797 + four random numbers
1798 + four random numbers
1799 + four random numbers
1800 + four random numbers
1801 + four random numbers
.
.
. continues all the way until 2011

All the "random numbers" do NOT follow a pattern. :(
Was This Post Helpful? 0
  • +
  • -

#10 poncho4all  Icon User is offline

  • D.I.C Head!
  • member icon

Reputation: 123
  • View blog
  • Posts: 1,405
  • Joined: 15-July 09

Re: Adding any suffix to a base URL

Posted 26 February 2011 - 06:52 PM

Cant think of anything man, all ive got right now is to iterate, from less possible number to max posible number checking if website exists. But this would take to long :(
Was This Post Helpful? 0
  • +
  • -

#11 patticus73  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 9
  • Joined: 22-February 11

Re: Adding any suffix to a base URL

Posted 26 February 2011 - 07:13 PM

Thanks for trying.

Anyone else have an idea? Maybe using regular expressions? I have no idea. :(
Was This Post Helpful? 0
  • +
  • -

#12 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 759
  • View blog
  • Posts: 2,010
  • Joined: 23-December 08

Re: Adding any suffix to a base URL

Posted 27 February 2011 - 08:23 AM

I don't fully understand what you're looking for... you just want to append a 4 digit random number to the end of a string? That's simple enough.

for years in range(1790,2012):
    for nums in range(10000):
        myNumber = str(years)+"%04d" %(nums)






Why are you doing this? To be honest it looks like you're trying to use Python to locate private pictures on photobucket :/

This post has been edited by atraub: 27 February 2011 - 08:34 AM

Was This Post Helpful? 0
  • +
  • -

Page 1 of 1