8 Replies - 9404 Views - Last Post: 01 February 2013 - 08:35 AM

#1 Momerath  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1010
  • View blog
  • Posts: 2,444
  • Joined: 04-October 09

String matching one of many patterns

Posted 28 January 2013 - 04:09 AM

I've been tasked with processing a sort of log file that is constantly being updated. These updates are usually fairly slow (maybe 1-2 second) but has bursts where it may add tens to hundreds of lines in a second.

The issue I think I'll have is that the lines have different formats, over 300 different formats so far (still gathering all the possible outputs) and I need to know which of the hundreds of formats match a specific line so I can extract information from that line. This information needs to be displayed in as close to real time as I can get.

Some examples of the patterns of lines are

<fixed text>
- These lines have specific text that never varies. There are lots of these types of lines.

<fixed text> <variable text>
- in this case the line begins with the same information but has something added on the end that is important.

<variable text> <fixed text> <variable text> <fixed text> <variable text> <fixed text>
- In a case line this I need to know all the variable text information.



Now I know I can use regular expressions to get me the information I need (and to identify which type of line it is). I just think running through hundreds of regular expressions looking for one that matches is not the best way to go about this, but can't think of anything else that would work.

I've done some research and have found methods of checking a single string for multiple match strings, but they all use fixed matching strings. I might be able to use these methods to check for the fixed parts of the lines (no two 'patterns' contains the exact same fixed text, well at least that I've seen) then use a regex to pull the information out of the line once it has been identified.

So the question is does anyone have any experience with this type of thing and how did you go about solving it? Speed and accuracy are the important issues here :) I do want to emphasis that I haven't done any actual testing yet and it may turn out that I'm anticipating an issue that really doesn't exist. I suppose that should be what I do next.

Is This A Good Question/Topic? 0
  • +

Replies To: String matching one of many patterns

#2 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 3569
  • View blog
  • Posts: 11,089
  • Joined: 05-May 12

Re: String matching one of many patterns

Posted 28 January 2013 - 06:42 AM

For the 3rd case, is there a distinguished delimiter between the leading variable text and the fixed text? Or is it simply a space or comma that looks like any other delimiter used for the other lines?

Anyway, on first glance, this seems to be a good application of tries, at least for the strings with leading fixed text.
Was This Post Helpful? 1
  • +
  • -

#3 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 3569
  • View blog
  • Posts: 11,089
  • Joined: 05-May 12

Re: String matching one of many patterns

Posted 28 January 2013 - 07:29 AM

Also take time to read the code for ZipLib or find a high level description of the code. Its compression code has a pretty elegant and fast implementation of a hash and trie that quickly finds matching long strings within its sliding window/dictionary. The approach taken there my help you come up with your own fast implementation.
Was This Post Helpful? 0
  • +
  • -

#4 tlhIn`toq  Icon User is offline

  • Please show what you have already tried when asking a question.
  • member icon

Reputation: 5509
  • View blog
  • Posts: 11,814
  • Joined: 02-June 10

Re: String matching one of many patterns

Posted 28 January 2013 - 09:37 AM

What about something like a code parsing scheme?
We see it right here with the code blocks being colorized and I know there are some projects out there for making parsers for various languages - and they are all about lines of unknown format.

It seems if they can pick which words should be red or blue, they should be adaptable for need like hide or display.
Was This Post Helpful? 0
  • +
  • -

#5 AdamSpeight2008  Icon User is offline

  • MrCupOfT
  • member icon


Reputation: 2262
  • View blog
  • Posts: 9,462
  • Joined: 29-May 08

Re: String matching one of many patterns

Posted 28 January 2013 - 09:51 AM

The code tag doesn't parse the text as code.
It just colorizes it, in particular keywords.

vb.net is especially difficult to highlight correctly.
The character < has (currently) 18 different potential meanings and ( has ~34 different meanings. Then there's the difficult in recognizing strings and comments.
Was This Post Helpful? 0
  • +
  • -

#6 Curtis Rutland  Icon User is online

  • (╯°□°)╯︵ (~ .o.)~
  • member icon


Reputation: 4480
  • View blog
  • Posts: 7,803
  • Joined: 08-June 10

Re: String matching one of many patterns

Posted 28 January 2013 - 10:10 AM

Just a complete shot in the dark, but you might be able to use an existing tool:

http://technet.micro...r/dd919274.aspx
Was This Post Helpful? 3
  • +
  • -

#7 tlhIn`toq  Icon User is offline

  • Please show what you have already tried when asking a question.
  • member icon

Reputation: 5509
  • View blog
  • Posts: 11,814
  • Joined: 02-June 10

Re: String matching one of many patterns

Posted 28 January 2013 - 10:30 AM

That looks cool - or sounds like it has potential. I'm going to dig into that as well for my needs. Thanks!
Was This Post Helpful? 0
  • +
  • -

#8 Momerath  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1010
  • View blog
  • Posts: 2,444
  • Joined: 04-October 09

Re: String matching one of many patterns

Posted 28 January 2013 - 04:48 PM

View PostCurtis Rutland, on 28 January 2013 - 09:10 AM, said:

Just a complete shot in the dark, but you might be able to use an existing tool:

http://technet.micro...r/dd919274.aspx


Nice tool, but not what I need. I need all input (which is being updated in real time) not a static query on the previous lines to select some of them.
Was This Post Helpful? 0
  • +
  • -

#9 Momerath  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1010
  • View blog
  • Posts: 2,444
  • Joined: 04-October 09

Re: String matching one of many patterns

Posted 01 February 2013 - 08:35 AM

Just a follow-up:

I did some time testing with about 30 patterns and got the following results:
Regex patterns only : ~600 lines/sec
Static text search followed by Regex (if static text matched) : ~80,000 lines/sec

Given this I've decided that the second method is the way to go. One optimization that I'll be using is the order in which the static text is checked (some types of lines occur rarely so should be the last check, while others are very very common (30-40 patterns make up about 80% of the lines) so they'll be checked first).

I am looking at the Aho–Corasick string matching algorithm as it seems ideal for this situation. +1 to trie suggestion above :)
Was This Post Helpful? 1
  • +
  • -

Page 1 of 1