Searching a string that contains nulls

  • (3 Pages)
  • +
  • 1
  • 2
  • 3

37 Replies - 1643 Views - Last Post: 23 June 2019 - 09:02 AM Rate Topic: -----

#1 ArtificialSoldier   User is offline

  • D.I.C Lover
  • member icon

Reputation: 2334
  • View blog
  • Posts: 7,113
  • Joined: 15-January 14

Searching a string that contains nulls

Posted 14 June 2019 - 03:37 PM

In C, I'm trying to search in a string using strstr, but I've found a case where this string might be littered with null characters that I need to look beyond. The string is read from a file using fread, and then I use strstr and strtol to look for substrings and convert specific parts to long. Are there library functions similar to strstr that will scan an entire character array instead of stopping at the first null character?

It seems like the only alternative might be to read the file and record the length of the data that was read, and then pass my character array and its total length around to the other functions that need to use it. I would need to write a function similar to strstr that would get each substring that is the same length as the one I'm looking for just by looping through the characters and compare it with my needle. Is that the right way to go about this?

Is This A Good Question/Topic? 0
  • +

Replies To: Searching a string that contains nulls

#2 Skydiver   User is offline

  • Code herder
  • member icon

Reputation: 6969
  • View blog
  • Posts: 23,685
  • Joined: 05-May 12

Re: Searching a string that contains nulls

Posted 14 June 2019 - 04:15 PM

Time to break out Knuth and implement Rabin-Karp or Knuth-Morris-Pratt string search algorithms.
Was This Post Helpful? 1
  • +
  • -

#3 Skydiver   User is offline

  • Code herder
  • member icon

Reputation: 6969
  • View blog
  • Posts: 23,685
  • Joined: 05-May 12

Re: Searching a string that contains nulls

Posted 14 June 2019 - 04:34 PM

If that is too much trouble then you could implement something like the following pseudo code:
// load data into a buffer 
fseek(pFile, 0, SEEK_END)
bufferSize = ftell(pFile)
rewind(pFile)
buffer = malloc(bufferSize + 1)    // allocate space for a null terminator
fread(buffer, 1, bufferSize, pFile)
buffer[bufferSize] = 0             // emplace a null terminator
bufferSize += 1


// search for string in buffer
int find(char * needle, char * buffer, int bufferSize)
{
    needleSize = strlen(needle)
    while (bufferSize >= needleSize)
    {
        found = strchr(needle, buffer)
        if (found)
            return 1;

        segmentLength = strlen(buffer)
        segmentLength = segmentLength + 1;    // count the null terminator
        buffer = buffer + segmentLength
        bufferSize = bufferSize - segmentLength
    }
    return 0; // not found
}



Please double check the pseudo code above for off-by-one errors that may overrun the buffer. The general idea is that if strchr() can't find the string within the current null terminated segment pointed to by buffer, make buffer point to the block of memory after the null terminator.

Please note that the pseudo code above is not very efficient because is runs through a memory segment once with a call to strchr(), and then runs over the same segment with a call to strlen(). You could try to memoize where each of the segments starts so that succeeding searches won't have to keep calling strlen(), but that there are a lot more savings to be found if you implement one of the two algorithms I mention earlier in post #2. As I recall from your other thread where you were using strchr() you were looking for maximum efficiency.
Was This Post Helpful? 1
  • +
  • -

#4 #define   User is offline

  • Duke of Err
  • member icon

Reputation: 1860
  • View blog
  • Posts: 6,698
  • Joined: 19-February 09

Re: Searching a string that contains nulls

Posted 14 June 2019 - 05:03 PM

You could change the null characters to spaces. Either when you have the array in memory, or by creating your own fread type of function.
Was This Post Helpful? 1
  • +
  • -

#5 ArtificialSoldier   User is offline

  • D.I.C Lover
  • member icon

Reputation: 2334
  • View blog
  • Posts: 7,113
  • Joined: 15-January 14

Re: Searching a string that contains nulls

Posted 14 June 2019 - 05:13 PM

Quote

The general idea is that if strchr() can't find the string within the current null terminated segment pointed to by buffer, make buffer point to the block of memory after the null terminator.

That's brilliant. My first idea was to get the first character of the needle and seek until I find it. Assuming strchr is faster than a for loop, your idea sounds a lot more efficient.

I'll take a look at some of the other algorithms and see what I can come up with, thanks.

Quote

You could change the null characters to spaces. Either when you have the array in memory, or by creating your own fread type of function.

That's not a bad idea either.
Was This Post Helpful? 0
  • +
  • -

#6 Skydiver   User is offline

  • Code herder
  • member icon

Reputation: 6969
  • View blog
  • Posts: 23,685
  • Joined: 05-May 12

Re: Searching a string that contains nulls

Posted 14 June 2019 - 08:14 PM

Replacing nulls with spaces (or some other non-zero character) maybe the lowest friction approach.

Pseudo code:
// load data into a buffer 
fseek(pFile, 0, SEEK_END)
bufferSize = ftell(pFile)
rewind(pFile)
buffer = malloc(bufferSize + 1)    // allocate space for a null terminator
fread(buffer, 1, bufferSize, pFile)
for(i = 0; i < bufferSize; i++)
{
    if (buffer[i] == '\0')
        buffer[i] = 30;    // ASCII code for Record Separator
}
buffer[bufferSize] = 0             // emplace a null terminator


Then just use strstr() as you normally do.
Was This Post Helpful? 0
  • +
  • -

#7 jimblumberg   User is offline

  • member icon

Reputation: 5733
  • View blog
  • Posts: 17,568
  • Joined: 25-December 09

Re: Searching a string that contains nulls

Posted 15 June 2019 - 05:17 AM

Quote

In C, I'm trying to search in a string using strstr, but I've found a case where this string might be littered with null characters that I need to look beyond.

So then you need to treat this "string" as an array of characters not a string, which means that you shouldn't be using the standard string functions, until you somehow strip the "extra" nul characters from the array. Remember in C a string is an array of char terminated by the nul character, anything after the first nul character is considered non-valid.

And in Windows make sure you're using binary mode when reading that array.

Also if the "string" can contain a nul char then you probably should also be checking for other control characters as well. And remember a new line character could be more than a single character.

Jim
Was This Post Helpful? 0
  • +
  • -

#8 ArtificialSoldier   User is offline

  • D.I.C Lover
  • member icon

Reputation: 2334
  • View blog
  • Posts: 7,113
  • Joined: 15-January 14

Re: Searching a string that contains nulls

Posted 17 June 2019 - 10:10 AM

I did end up just converting the null characters. It may search the string from one to three times each time the module runs, and also potentially run strtol up to two times, so I thought that it would be quickest to just loop over the characters once and do any conversions right after the file is read.

// convert null characters to spaces
  int i;
  for (i = 0; i < length; i++) {
    if (buffer[i] == '\0') {
      buffer[i] = ' ';
    }
  }
  // add null character to the end
  buffer[length] = 0;
  length++;


Quote

Also if the "string" can contain a nul char then you probably should also be checking for other control characters as well.

Only a null character would stop strstr from searching further though, correct? The purpose of this code is just to check if certain substrings exist, I'm not printing anything or otherwise using the data other than just searching.

I'll make a modification to check the position of the first null and compare that with the buffer length to decide whether I even need to loop.
Was This Post Helpful? 0
  • +
  • -

#9 Skydiver   User is offline

  • Code herder
  • member icon

Reputation: 6969
  • View blog
  • Posts: 23,685
  • Joined: 05-May 12

Re: Searching a string that contains nulls

Posted 17 June 2019 - 10:24 AM

View PostArtificialSoldier, on 17 June 2019 - 01:10 PM, said:

I'll make a modification to check the position of the first null and compare that with the buffer length to decide whether I even need to loop.

You are not gaining with this. If the null is truly at the end of the buffer, then you've scanned through the entire buffer (and incurred any of the corresponding page faults/cache load delays). If the null is somewhere before the end of the buffer, then on average, you would have scanned through at least half of the buffer to find the null, and then have to scan through that half of the buffer again, and then continue on to do the rest of the buffer.

Any which way, you should do some profiling to see what really pans out. It maybe that 80% of the time you'll find it at the end. Or maybe that 80% of the time, you'll always find the null within the first memory page of data and not incur any page faults or cache load delays.
Was This Post Helpful? 0
  • +
  • -

#10 ArtificialSoldier   User is offline

  • D.I.C Lover
  • member icon

Reputation: 2334
  • View blog
  • Posts: 7,113
  • Joined: 15-January 14

Re: Searching a string that contains nulls

Posted 17 June 2019 - 11:07 AM

This whole thing is an edge case, we have a couple clients that have a SAML login setup on their application, and the SAML package we're using decided to use nulls in some of the session variable names, so they show up in the saved session file. This is like 1% of the use cases, if that.

Would using strlen really need to have it loop through the entire array, and would that additional check really be slower than just always looping myself? I'll try to find some more files that I can use to test with and start timing things.
Was This Post Helpful? 0
  • +
  • -

#11 ArtificialSoldier   User is offline

  • D.I.C Lover
  • member icon

Reputation: 2334
  • View blog
  • Posts: 7,113
  • Joined: 15-January 14

Re: Searching a string that contains nulls

Posted 17 June 2019 - 11:24 AM

It looks like checking with strlen is faster instead of always looping. The good news is that it needs 10,000 iterations to test this. Here's the version with the check:

  if (strlen(buffer) < length) {
    //printf("Removing nulls\n");
    int i;
    for (i = 0; i < length; i++) {
      if (buffer[i] == '\0') {
        buffer[i] = ' ';
      }
    }

    buffer[length] = 0;
    length++;
  }


That's the first test. The second test is without the outer if statement:

[[email protected] c]# ./a.out
10000 iterations: read_file took 1.850000 seconds to read sess_test2
10000 iterations: read_file took 0.200000 seconds to read sess_test2b
10000 iterations: read_file took 1.220000 seconds to read sess_test2c
10000 iterations: read_file took 0.200000 seconds to read sess_test2d
10000 iterations: read_file took 0.200000 seconds to read sess_test2e
10000 iterations: read_file took 0.320000 seconds to read sess_test2f
10000 iterations: read_file took 0.210000 seconds to read sess_test2g
10000 iterations: read_file took 0.190000 seconds to read sess_test2h
[[email protected] c]# gcc test2.c
[[email protected] c]# ./a.out
10000 iterations: read_file took 1.840000 seconds to read sess_test2
10000 iterations: read_file took 0.220000 seconds to read sess_test2b
10000 iterations: read_file took 5.440000 seconds to read sess_test2c
10000 iterations: read_file took 0.300000 seconds to read sess_test2d
10000 iterations: read_file took 0.270000 seconds to read sess_test2e
10000 iterations: read_file took 0.710000 seconds to read sess_test2f
10000 iterations: read_file took 0.330000 seconds to read sess_test2g
10000 iterations: read_file took 0.240000 seconds to read sess_test2h



That 2c file is the largest session file I could find quickly. The first file is the only one that contains nulls. So that first file is actually very slightly faster without the if statement, but checking the string length is better in all other cases. Since the presence of nulls is an edge case, I'll optimize for everything else and check the strlen before looping.
Was This Post Helpful? 0
  • +
  • -

#12 #define   User is offline

  • Duke of Err
  • member icon

Reputation: 1860
  • View blog
  • Posts: 6,698
  • Joined: 19-February 09

Re: Searching a string that contains nulls

Posted 17 June 2019 - 05:20 PM

A substitute for strstr could have a function declaration such as

char * strstrnull ( const char * buffer, const char * buffer_start, 
                    const char * substr, size_t       buffer_length );




Parameters:

buffer : start of string to be scanned
buffer_start : start position in string for current scan
substr : sequence to match
buffer_length : length of string? buffer?

Return Value:

A pointer to the next occurrence in buffer_string, starting from buffer_start, of the entire sequence of characters specified in substr, or a null pointer if the sequence is not present in buffer.

This post has been edited by #define: 17 June 2019 - 05:21 PM
Reason for edit:: tidiness

Was This Post Helpful? 0
  • +
  • -

#13 Skydiver   User is offline

  • Code herder
  • member icon

Reputation: 6969
  • View blog
  • Posts: 23,685
  • Joined: 05-May 12

Re: Searching a string that contains nulls

Posted 17 June 2019 - 05:50 PM

What optimization settings do you ha e for your a.out? It would be like trying to compare apples and oranges if you are comparing the speed of strlen() which is fully optimized vs. your for loop in post #8.
Was This Post Helpful? 0
  • +
  • -

#14 Skydiver   User is offline

  • Code herder
  • member icon

Reputation: 6969
  • View blog
  • Posts: 23,685
  • Joined: 05-May 12

Re: Searching a string that contains nulls

Posted 17 June 2019 - 05:58 PM

On further though, it still going to be very hard to beat a strlen() implementation that has been optimized by a compiler that uses REPNE SCASB to scan for a zero byte vs. the for loop above which may not be optimized to scan to the next zero byte, make sure not at end of buffer, replace with a space, then continue scanning again.
Was This Post Helpful? 0
  • +
  • -

#15 ArtificialSoldier   User is offline

  • D.I.C Lover
  • member icon

Reputation: 2334
  • View blog
  • Posts: 7,113
  • Joined: 15-January 14

Re: Searching a string that contains nulls

Posted 18 June 2019 - 11:51 AM

I'm not sure what the settings are, but I'm also not sure how to do a better test, or I suppose I could make a plan there but it's a whole lot more effort than just doing some basic timing. I do know that my compiler isn't used in the end product, though, and I'm hesitant to add much logging to any production version (the client that has the nulls in their session data is on one of our shared servers). I could add some logging to run on our test server but in order to test this I would either need to set up a test SAML environment or maybe just do some of my own PHP session exploiting and change the session to what I want it to be for a certain user.

The issue with an actual test and the compilers and whatnot is that I'm using the OpenSuse build system to build an RPM that I can later install on the servers, so the actual compiling happens on the OSC servers in a VM where I just tell it which platforms to build for. I commit my changes, my source gets transferred, it builds the RPMs, and then I install. The final product is an Apache module and I'm just doing standalone testing here. So in terms of apples and oranges, yeah, I have no idea if I'm even dealing with fruit. That compiler might be a turbocharged goat, I have no idea. I might be dreaming about gorgonzola when it's clearly brie time, baby.
Was This Post Helpful? 1
  • +
  • -

  • (3 Pages)
  • +
  • 1
  • 2
  • 3