5 Replies - 922 Views - Last Post: 03 May 2011 - 02:19 PM Rate Topic: -----

#1 TomJoad  Icon User is offline

  • D.I.C Head

Reputation: 12
  • View blog
  • Posts: 54
  • Joined: 01-December 10

str.rfind() skipping actual rfind() while looping

Posted 01 May 2011 - 01:03 PM

Hello everyone:

Below, please find a section of code which is not working as intended. If there is an interest, I can modify current code so it can be run locally. I know sometimes it is easier to spot the problem after seeing a piece of code execute.

			 for (int xxx = 0; xxx < 50; xxx++)
			 {
				 int iii = 0;
				 
				 cout << "\n\n\n*****the int iii == " << iii << endl;
				 
				 
				 size_t testContents = contents.rfind(firstandlast[0]);
				 if (int(testContents) == contents.npos)
				 {
					 cout << "rfind firstandlast[0] not found, breaking" << endl;
					 contents.clear();
					 break;
				 }
				  //used to test contents to see if it contains the first parsing hit, if not, break the loops
						
			 
				 for (iii; iii<firstandlast.size(); iii+=2)
				{

						 
					 string first = firstandlast[iii];
					 string last = firstandlast[iii+1];
					 
					 size_t foundfirst;
					 size_t foundlast;
					 
					 foundfirst = contents.rfind(first); //gets the position of the first
					 foundlast = contents.find(last, int(foundfirst)+1); //gets the position of the last
					 //finds the firstHit and lastHit positions
					 
					 int i_foundlast = int(foundlast); //changes foundlast to an int
					 int i_foundfirst = int(foundfirst) + first.size(); //changes to an int
					 //this is probably unnecassary
					 
					 contents.resize(i_foundlast);
					 //crops the contents at the position of the lastHit
					 
					 char buffer[256];
					 contents.copy(buffer, 256, i_foundfirst);
					 string sbuffer = string(buffer);
					 //copies the string at the end of firstHit into buffer to sbuffer
					 
					 for (int resetBuffer = 0; resetBuffer < 256; resetBuffer++)
					 {
						 buffer[resetBuffer] = '\0';
					 }
					 //resets buffer
					 
					size_t endpos = sbuffer.find_last_not_of(" \r\t\n");
					if( string::npos != endpos )
					{
					sbuffer = sbuffer.substr( 0, endpos+1 );
					}
					//trim trailing space,etc.
					
					size_t startpos = sbuffer.find_first_not_of(" \r\t\n");
					if( string::npos != startpos )
					{
					sbuffer = sbuffer.substr( startpos );
					}
					//trim leading space,etc.
					 
					cout << "\n===========CONTENTS===========\n\n= " << sbuffer <<" =="<<endl;
					//used for debugging in the console
					  
					cout << "\n============first and last=========\n\n=" << first<< " =\n= " << last << " ="<< endl; 
					//used for debugging in the console
					
					ofstream write;
					write.open ("results.txt", fstream::app);
					write << "\n>>>" << iii << "<<<\n" << sbuffer << endl << s_fileno << endl;
					write << "--->> First --->>" << first << endl;
					write << "\n--->> Last --->>" << last << endl;
					
					string tmpcontents;
					int endsize = contents.size() - 500;
					tmpcontents = contents.substr(endsize);
					
					write << "\n\n======================BEGIN URL CONTENTS=========================\n\n" << tmpcontents << "\n\n======================END URL CONTENTS=========================\n\n" << endl;
					//debugging through the txt file
					
					write.close();
				}
				

			}


Notes:
  • This part of the code is a method of a class. Contents is a std::string of the same class -- this is why it is not initialized.
  • firstandlast is a vector that was already initialized and is working properly.
  • from line 63 down -- used for debugging.
  • firstHit (in comments) is == to first; same goes with lastHit(in comments)
  • the code compiles and runs without error.


The problem:
It loops through the second for loop ( for (iii; iii<firstandlast.size(); iii+=2) ) correctly the first time. In the process, you will notice that it is resizing contents so the loop will repeat without picking up the same information. However, when it goes to loop the second time, it is actually skipping what the next rfind() should be, and goes directly to the next one.

The solution appears to be to reset the pos of contents to the end, but shouldn't rfind() be doing that automatically? Perhaps I don't completely understand how resize() or rfind() works.

Debug txt file it generates::
Spoiler


You will notice that when it loops back through the second time (represented by the second >>>0<<<) it actually found text that is well above what it should of been (you can find what the rfind() should of found by looking that the url contents directly above)

I feel bad for coming to this form twice in a week to help solve my problem, but, unfortunately, I don't know anyone irl who programs in C++ so this form is kind of my only way to get outside help. :/

If any of this is unclear, please let me know.

Is This A Good Question/Topic? 0
  • +

Replies To: str.rfind() skipping actual rfind() while looping

#2 JackOfAllTrades  Icon User is offline

  • Saucy!
  • member icon

Reputation: 6066
  • View blog
  • Posts: 23,526
  • Joined: 23-August 08

Re: str.rfind() skipping actual rfind() while looping

Posted 01 May 2011 - 02:14 PM

So you're parsing HTML? Is there any chance the HTML is actually XHTML that could potentially be parsed by an XML library using XPath? Parsing HTML is ugly in any language, but it's particularly heinous in C/C++.
Was This Post Helpful? 0
  • +
  • -

#3 TomJoad  Icon User is offline

  • D.I.C Head

Reputation: 12
  • View blog
  • Posts: 54
  • Joined: 01-December 10

Re: str.rfind() skipping actual rfind() while looping

Posted 01 May 2011 - 02:46 PM

Yes, I am parsing through html. However, I have tried to set it up so it would parse through anything. So, theoretically, that shouldn't be the problem. If you look through the debug text, some of what it is looking for aren't full html tags, but partial ones.

The basic idea:
I have a string (contents in this case) and I give the program two parameters (first and last). The program will find first through rfind() (line 28) so it works from bottom to top. After it locates first, it locates last using find() (line 29). It then crops at the location of last (line 36), and copies from the end of the location of first to the end of the string (line 40). Thus it is coping what is between first and last. It then trims whitespace and other non-printing characters (lines 50-62). It then keeps doing this until the vector firstandlast is done with (iii<firstandlast.size()). This is what I consider it completing the loop the first time.

The program then does a check to make sure it will find firstnandlast[0] again (line 8)-- meaning there is more information left to be extracted.

It will then continue through these loops until firstandlast[0] (line 8) is no longer found.

What is happening is it loops through the first time (paragraph 2, above) but it is skipping what it should actually be finding the second time around. Basically first extraction works, second extraction is skipped, third extraction works, forth extraction is skipped, etc. Obviously, the program isn't skipping times 2,4,6, etc., on purpose, as computers only do what you tell them to. I have somehow told it to do this, but cannot figure out how I did it, or how to fix it.

This post has been edited by TomJoad: 01 May 2011 - 02:49 PM

Was This Post Helpful? 0
  • +
  • -

#4 TomJoad  Icon User is offline

  • D.I.C Head

Reputation: 12
  • View blog
  • Posts: 54
  • Joined: 01-December 10

Re: str.rfind() skipping actual rfind() while looping

Posted 03 May 2011 - 10:16 AM

I believe my problem actually lies with the string contents itself... I did some more testing and manually set the string contents and ran it through the parser and it worked.

I've done some more playing around and added:

			 ofstream tmpurlcontents;
			 tmpurlcontents.open("tmpurl.txt");
			 tmpurlcontents << contents;
			 tmpurlcontents.close();
			 contents.clear();
			 
			 ifstream readurl;
			 readurl.open("tmpurl.txt");
			 while (getline(readurl, contents))
			 {
				 contents.append(contents);
			 }
			 readurl.close();


So that way contents can be written to a file and then read back in. However, nothing is coming back in that is a string. I even tried instead of while(getline...), readurl >> contents to read it directly. Still nothing.

I try to open the file and it tells me it is not UTF-8 or Western (ISO-8859-15). I can, however, open it with Windows 7 notepad.

In an effort of desporation, here is the complete URL grabbing and changing it to a string process:

#define MAX_FILE_LENGTH 20000

// Helper Class for reading result from remote host
	http::http()
	{
		this->m_pBuffer = NULL;
		this->m_pBuffer = (char*) malloc(MAX_FILE_LENGTH * sizeof(char));
		this->m_Size = 0;
	};

	http::~http()
	{
		if (this->m_pBuffer)
			free(this->m_pBuffer);
	};

	void* http::Realloc(void* ptr, size_t size)
	{
		if(ptr)
			return realloc(ptr, size);
		else
			return malloc(size);
	};

	// Callback must be declared static, otherwise it won't link...
size_t http::WriteMemoryCallback(char* ptr, size_t size, size_t nmemb)
	{
		// Calculate the real size of the incoming buffer
		size_t realsize = size * nmemb;

		// (Re)Allocate memory for the buffer
		m_pBuffer = (char*) Realloc(m_pBuffer, m_Size + realsize);

		// Test if Buffer is initialized correctly & copy memory
		if (m_pBuffer == NULL) {
			realsize = 0;
		}

		memcpy(&(m_pBuffer[m_Size]), ptr, realsize);
		m_Size += realsize;


		// return the real size of the buffer...
		return realsize;
	};


	void http::print() 
	{
		contents = std::string(m_pBuffer);
		//std::cout << "Size: " << m_Size << std::endl;
		//std::cout << "Content: " << std::endl << m_pBuffer << std::endl;
	}


How it starts:

								
http httpwork;
curlpp::Cleanup cleaner;
								curlpp::Easy request;

								//http mWriterChunk;

								// Set the writer callback to enable cURL 
								// to write result in a memory area
								curlpp::types::WriteFunctionFunctor functor(&httpwork, 
									&http::WriteMemoryCallback);
								curlpp::options::WriteFunction *test = new curlpp::options::WriteFunction(functor);
								request.setOpt(test);

								// Setting the URL to retrive.
								request.setOpt(new curlpp::options::Url(url));
								request.setOpt(new curlpp::options::Verbose(true));
								request.perform();

								httpwork.print();


I'm using libcurlpp.

This is getting way more technical than I ever thought it would. Messages coming in through libcurl are that the format it receives is UTF-8.

Suddenly this has gone way over my head in terms of expertise. Is it writing the file as binary? I don't know what is going on anymore.

It tells me that the upload fails every time I try to upload the file (it's only 31kb).

I'm so stumped :(
Was This Post Helpful? 0
  • +
  • -

#5 ishkabible  Icon User is offline

  • spelling expret
  • member icon




Reputation: 1622
  • View blog
  • Posts: 5,709
  • Joined: 03-August 09

Re: str.rfind() skipping actual rfind() while looping

Posted 03 May 2011 - 10:28 AM

im not sure what method of parsing your trying to use but i would recommend using a recursive decent parser. recursive decent parsers are typically clean and easy to understand, in any case using an established parsing method would be highly beneficial.

something like this would parse an HTML block
void parseBlock() {
   Expect("<");
   std::string temp = Identifier();
   Expect(">)";
   BodyOfBlock();
   Expect("<");
   Expect("/");
   Expect(temp);
   Expect(">");
}



'Expect' would expect a specific token.
'Identifier' would expect an identifier such as 'h1' or 'h2'.
'BodyOfBlock' would expect either some text or another block donated by "<" occurring as the first symbol. 'BodyOfBlock' would also keep parsing until it reached an unqualified "<".
Was This Post Helpful? 1
  • +
  • -

#6 JackOfAllTrades  Icon User is offline

  • Saucy!
  • member icon

Reputation: 6066
  • View blog
  • Posts: 23,526
  • Joined: 23-August 08

Re: str.rfind() skipping actual rfind() while looping

Posted 03 May 2011 - 02:19 PM

The mixing of C/C++ is rather ugly. Why not use a stringstream in your http class? On a side note in that area:

void* http::Realloc(void* ptr, size_t size)
{
    if(ptr)
	return realloc(ptr, size);
    else
	return malloc(size);
};


you don't actually need to do that, because realloc, when passed a NULL pointer, behaves like malloc. However, when using realloc you really need to employ a temporary pointer, because if it fails you will lose the already-allocated memory and you could end up with a massive memory leak.

void* http::Realloc(void* ptr, size_t size)
{
    void *temp = realloc(ptr, size);
    if (!temp)
    {
        cerr << "Memory allocation failure" << endl;
        free(ptr);
        ptr = NULL;
    }
    else
    {
        ptr = temp;
    }
    
    return ptr;
};


Was This Post Helpful? 1
  • +
  • -

Page 1 of 1