14 Replies - 642 Views - Last Post: 11 February 2010 - 07:57 AM Rate Topic: -----

#1 Guest_r3bb*


Reputation:

File names from URLs to a text file

Posted 09 February 2010 - 07:15 AM

Hi.
I need to put all the files' names (in my case, the files all are jpg images) from a website's specified directory into a text file as lines.
I don't know how to do this, as HTTPWebRequest and HTTPWebResponse aren't what I need (I actually have never used them, but it was the only thing I could find some information about).
Could you give me some suggestions about how to do it?
To be more specific, the urls are all like
http://images.comicbookresources.com/solicits/marvelcomics/201001/42_GHOST_RIDERS__HEAVENS_ON_FIRE_6.jpg

(yeah, these are comic books covers).
Assuming that I always know what the folder I'm interested in is (which is true), of course the only thing that changes is the name of the file itself.
I've been given a suggestion to use the Bing API, but it should also be noted that I'm only interested to the files in that specific folder, and not all the similar stuff that a search using the Bing API could bring up.
Maybe I'm wrong about it, but the best thing I can think to use it would be to use request.Query="marvel solicitations site:comicbookresources.com", which would bring up an awful lot of unwated results. Also, I don't quite understand the differences between JSON, XML and SOAP implementations, but if the Bing API isn't the solution tha in your opinion suits me best, then I'm not interested :D
Thanks in advance.

Is This A Good Question/Topic? 0

Replies To: File names from URLs to a text file

#2 Ferencn  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 71
  • View blog
  • Posts: 322
  • Joined: 01-February 10

Re: File names from URLs to a text file

Posted 09 February 2010 - 07:56 AM

Sounds like you're trying to leech a bunch of files from a website.

Please show us some code and we may be able to help you fix it. We will not write the application for you.
Was This Post Helpful? 0
  • +
  • -

#3 JackOfAllTrades  Icon User is offline

  • Saucy!
  • member icon

Reputation: 6031
  • View blog
  • Posts: 23,413
  • Joined: 23-August 08

Re: File names from URLs to a text file

Posted 09 February 2010 - 08:22 AM

WTH does Bing have to do with anything??? That's weird.
Was This Post Helpful? 0
  • +
  • -

#4 Sergio Tapia  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1252
  • View blog
  • Posts: 4,168
  • Joined: 27-January 10

Re: File names from URLs to a text file

Posted 09 February 2010 - 08:30 AM

You can download a websites HTML source code using the WebClient.Downloadstring() method.
Was This Post Helpful? 0
  • +
  • -

#5 Guest_Guest*


Reputation:

Re: File names from URLs to a text file

Posted 09 February 2010 - 08:45 AM

Well, this is all I have at the moment (it's a dull moment at work, we don't have Visual Studio so I can't be more precise, it's hard to write code using a simple notepad... this evening when I'll be home my ideas will take a better form).

using System;
using System.IO;
using System.Net;

public class FileClass
{
	
	private void btnGenerate_Click(object sender, EventArgs e)
	{
	// I have 2 comboboxes, one for choosing the month and the other for the year
	
	// this will produce 201001 for January 2010 or 200911 for November 2009, for example, which is the way the site identifies the folders for months
	string cod=year+month;
	
	string fileName = cod+".txt";
	File.CreateText(fileName);
	
	string imageName;
	// of course what I'm missing is the way to put what I want into that imageName variable
	AppendToFile(imageName,fileName,cod);
	}
	
	}
	static void AppendToFile(string img,string file,c)
	{
	StreamWriter SW;
	SW=File.AppendText(file);
	SW.WriteLine("[URL=http://images.comicbookresources.com/solicits/marvelcomics/" + c + "/"+img+"][IMG=http://images.comicbookresources.com/solicits/marvelcomics/" + c + "/sm/"+img+"][/URL]");
	SW.Close();
	}
	
}

Was This Post Helpful? 0

#6 Guest_r3bb*


Reputation:

Re: File names from URLs to a text file

Posted 09 February 2010 - 09:14 AM

Sorry for the double post, I didn't notice those other two until I posted.

To JackOfAllTrades:
about Bing, that suggestion probably was because the site doesn't allow direct folder browsing: trying to access the directory in which the images are, which in the case shown in my previous example would be http://images.comicb...lcomics/201001/, shows a Forbidden Apache error.

To stapia.gutierrez:
I thought about getting the source code as well, but I thought having to work with it would be a HUGE pain in the ass because I'd have to filter the image names out of it. Or maybe I'm wring and it would be simpler than what it looks like?
I looked at the source code of the page where all the covers are shown. Every image name appears twice, once in an image tag such as

<img src="http://images.comicbookresources.com/solicits/marvelcomics/201001/sm/8_AMAZING_SPIDER_MAN_PRESENTS__JACKPOT_1.jpg"/>

and once in a url tag

<a href="/news/preview2.php?image=solicits/marvelcomics/201001/92_SPIDER_MAN__THE_CLONE_SAGA_5.jpg">

The difference between the two, apart from one being a direct link and the other one isn't, is that between the month's folder and the filename there is a /sm/ folder. So the images and their thumbnails have the same name and reside in different folders.
So now my question has become: is it possible to filter those filenames out of the huge file that contains the whole source code (which I could easily create), taking them from either one of these two tags?
Was This Post Helpful? 0

#7 Sergio Tapia  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1252
  • View blog
  • Posts: 4,168
  • Joined: 27-January 10

Re: File names from URLs to a text file

Posted 09 February 2010 - 09:21 AM

Yes, it is! Why don't you join the site, you ask questions showing ideas and not simple GIEF ME CODEZ PLX like some people.

What you can do is look into HTMLAgilityPack. It's a library (.dll) that you can use to filter whatever you need out of source code.

So you could fetch all of the IMG tags from the source and then use some sort of REGEX to fit things according to what month the user selected.

Just keep in mind that the second the webmaster changes his date scheme your application is DEAD. :P
Was This Post Helpful? 1
  • +
  • -

#8 JackOfAllTrades  Icon User is offline

  • Saucy!
  • member icon

Reputation: 6031
  • View blog
  • Posts: 23,413
  • Joined: 23-August 08

Re: File names from URLs to a text file

Posted 09 February 2010 - 09:21 AM

Where directory browsing isn't allowed, no API is going to help. You will need to download the HTML source and scrape the data out (hence the term web scraping) for subsequent retrieval.
Was This Post Helpful? 0
  • +
  • -

#9 Ferencn  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 71
  • View blog
  • Posts: 322
  • Joined: 01-February 10

Re: File names from URLs to a text file

Posted 09 February 2010 - 09:27 AM

View Postr3bb, on 09 February 2010 - 08:14 AM, said:

So now my question has become: is it possible to filter those filenames out of the huge file that contains the whole source code (which I could easily create), taking them from either one of these two tags?

Yes that is possible. Note that you could find some rules that allow you to compose the correct path to an image from a combination of the two paths in the thumbnail.
Stapia gutierrez gives some pointers and correctly observes that as soon as the naming convention, or the way the thumbnailpage is built changes, your application wil stop to function.
Was This Post Helpful? 0
  • +
  • -

#10 Sergio Tapia  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1252
  • View blog
  • Posts: 4,168
  • Joined: 27-January 10

Re: File names from URLs to a text file

Posted 09 February 2010 - 09:34 AM

I'll give you some help with the HTMLAgilityPack aspect. Using this you can do find all the img tags in an HTML source code file. Hope this helps. :)

Our target:
<img src="blabalbalbal.jpeg" />


How to fetch the actual URL's of the images:

var document = new HtmlWeb().Load(url);
var urls = document.DocumentNode.Descendants("img")
                                .Select(e => e.GetAttributeValue("src", null))
                                .Where(s => !String.IsNullOrEmpty(s));

Was This Post Helpful? 1
  • +
  • -

#11 r3bb  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 4
  • Joined: 09-February 10

Re: File names from URLs to a text file

Posted 09 February 2010 - 09:41 AM

Quote

Stapia gutierrez gives some pointers and correctly observes that as soon as the naming convention, or the way the thumbnailpage is built changes, your application wil stop to function.

Yes, I already thought of that. I guess that when that'll happen I'll change my program :D

Thanks for the help, when I get home I'll start experimenting with that. And then probably I'll annoy you again :D

By the way, as you can see I joined :)
Was This Post Helpful? 0
  • +
  • -

#12 Sergio Tapia  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1252
  • View blog
  • Posts: 4,168
  • Joined: 27-January 10

Re: File names from URLs to a text file

Posted 09 February 2010 - 09:48 AM

Great, don't hesitate to ask for help.
Was This Post Helpful? 0
  • +
  • -

#13 r3bb  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 4
  • Joined: 09-February 10

Re: File names from URLs to a text file

Posted 09 February 2010 - 09:58 AM

So, the WebClient.Downloadstring() method you mentioned before would give me a simple wall of text, while the var document = new HtmlWeb().Load(url); thing produces something in which I can in some way browse, something like Javascript's DOM, am I right?
Was This Post Helpful? 0
  • +
  • -

#14 Sergio Tapia  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1252
  • View blog
  • Posts: 4,168
  • Joined: 27-January 10

Re: File names from URLs to a text file

Posted 09 February 2010 - 10:01 AM

The DownloadString() method downloads exactly what you see when you press Ctrl+U in Firefox. The source html markup.
Was This Post Helpful? 0
  • +
  • -

#15 r3bb  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 4
  • Joined: 09-February 10

Re: File names from URLs to a text file

Posted 11 February 2010 - 07:57 AM

After thinking about it, I've come to the conclusion that I'll use DownloadString() and a regex to extract what I want.
I already have thought about the regex
string exp = @"(?<=http://images.comicbookresources.com/solicits/marvelcomics/" + date + @"/sm/)\w+(?<!tpb|hc|hcvar)\.jpg";

This one does just what I need, I've already tested it (yes, it heavily relies on the way the website is structured, I know).
I've also written a method that puts the source into a string
private string getPageSource(string url)
		{
			try
			{
				WebClient client = new WebClient();
				string source = client.DownloadString(url);
				webClient.Dispose();
			}
			catch (Exception e)
			{
				MessageBox.Show("Problemi nella connessione HTTP.");
			}
			return source;
		}

but I've got a couple of doubts.
The first one is whether or not it is a good idea to have such a huge string in memory, or maybe it would be better to use a text file to store it (the source is more or less 1400-1500 lines...).
The second one is how should I actually apply the regex to such string/file.
While using this very useful tool this morning, putting the regex (of course with an actual number instead of my date parameter, such as 201001) and putting the whole source in the Source texbox, I was given all the results I needed. In what form will the results be? My ultimate goal as I said in the first post is to create a text file that has every image name I filtered as a line. Will the result be a string alreay formatted this way? Will it be an array with each result as an element? Have I misunderstood the page I linked before and will it just be a string with the first match (because, as I read, regular expression return the first match)?
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1