Sequence of characters problem

  • (2 Pages)
  • +
  • 1
  • 2

28 Replies - 903 Views - Last Post: 15 February 2020 - 07:46 AM

#1 arturmuller   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 10
  • Joined: 11-February 20

Sequence of characters problem

Posted 11 February 2020 - 02:21 PM

I have a sequence of characters. The characters are ordered in chronological order. I am looking for an algorithm to group the sequence and remove errors in the data. I am having an hard time to explain in words/maths what the requirement of fixing the sequence is but something like "the outcome is to group as many characters as possible in a chain of constant letters" and minimum sequence length should be a setting. Below I am trying to visualize by some examples what I mean (3 chars minimum):

AAACCAACA => AAAAAAAAA

ACACAAAAA => AAAAAAAAA

AYYYYYYYA => YYYYYYYYY

YYYYAAYYY => YYYYYYYYY

AYAYAYYYY => YYYYYYYYY

And for longer sequences it becomes a little more tricky

AYAYAYAYAAAAAAAAAAYAYYYYYYYYYYYYYY => AAAAAAAAAAAAAAAAAAYYYYYYYYYYYYYYYY

HTHTTHHHTTHHH => TTTTTHHHHHHHH

TTOAOAOOAATTA => OOOOOOOOAAAAA

This is bit tricky because even thought the 3 O's aren't directly next to each other, they are still the most probable uninterrupted sequence of characters.

Does anyone know of any algorithm (Machine learning?) or similar (term of this problem, problem name) that can do this type of error-code correction?

Thank you in advance

Is This A Good Question/Topic? 0
  • +

Replies To: Sequence of characters problem

#2 modi123_1   User is offline

  • Suitor #2
  • member icon



Reputation: 15497
  • View blog
  • Posts: 62,056
  • Joined: 12-June 08

Re: Sequence of characters problem

Posted 11 February 2020 - 02:33 PM

Quote

AAACCAACA => AAAAAAAAA

What happened to the C's? Why are there more A's now? How is this 'corrected'?
Was This Post Helpful? 0
  • +
  • -

#3 Skydiver   User is offline

  • Code herder
  • member icon

Reputation: 7239
  • View blog
  • Posts: 24,539
  • Joined: 05-May 12

Re: Sequence of characters problem

Posted 11 February 2020 - 03:21 PM

Also, how is this a C# question? This currently looks like a generic error correction problem that can be solved using any language.
Was This Post Helpful? 0
  • +
  • -

#4 modi123_1   User is offline

  • Suitor #2
  • member icon



Reputation: 15497
  • View blog
  • Posts: 62,056
  • Joined: 12-June 08

Re: Sequence of characters problem

Posted 11 February 2020 - 03:36 PM

Arty was here as well:
https://stackoverflo...pikes-in-signal
Was This Post Helpful? 1
  • +
  • -

#5 arturmuller   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 10
  • Joined: 11-February 20

Re: Sequence of characters problem

Posted 11 February 2020 - 04:43 PM

Thanks for your reply modi123_1.

This is corrected because we have 3 consecutive A's first and the C's never managed to "break" the chain of A's. This is something we only know after receiving the full set of characters. What would be considered a break should be a configurable number, in my example I set 3.
Was This Post Helpful? 0
  • +
  • -

#6 Skydiver   User is offline

  • Code herder
  • member icon

Reputation: 7239
  • View blog
  • Posts: 24,539
  • Joined: 05-May 12

Re: Sequence of characters problem

Posted 11 February 2020 - 05:45 PM

HTHTTHHHTTHHH => TTTTTHHHHHHHH


Can you explain this result?

I can see how you got the trailing H's due to:
HTHTTHHHTTHHH => TTTTTHHHHHHHH
where following your rule above that the 3 consecutive H's setting the trend.

But I don't understand why you would get this?
HTHTTHHHTTHHH => TTTTTHHHHHHHH
Is it because the T's have a bigger population than H's?

If you say that triple letter trends only go left to right, but not right to left, then why do the Y's move to the left in this case where the A's are in majority?
AYAYAYYYY => YYYYYYYYY
Was This Post Helpful? 0
  • +
  • -

#7 arturmuller   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 10
  • Joined: 11-February 20

Re: Sequence of characters problem

Posted 11 February 2020 - 06:44 PM

@Skydiver, yes on second thought you are right on both your statements/questions. Your results is correct of what I am trying to figure out.
Was This Post Helpful? 0
  • +
  • -

#8 Skydiver   User is offline

  • Code herder
  • member icon

Reputation: 7239
  • View blog
  • Posts: 24,539
  • Joined: 05-May 12

Re: Sequence of characters problem

Posted 11 February 2020 - 07:54 PM

I think that you need to articulate your transformation rules instead of showing examples because your current examples don't seem to show a consistent pattern.
Was This Post Helpful? 0
  • +
  • -

#9 Ornstein   User is offline

  • D.I.C Head

Reputation: 32
  • View blog
  • Posts: 64
  • Joined: 13-May 15

Re: Sequence of characters problem

Posted 12 February 2020 - 05:08 AM

As others have said, you might want to be more specific about your requirements (e.g. is this some assignment/challenge - or something you've dreamed up in your own mind?)

If you want a general solution, some AI approach would most likely work. You might benefit from reading up on unsupervised learning or unsupervised anomaly detection - or just anomaly detection in general.

(In your case, if you did need to train some learning algorithm, it might be a problem if you're making the rules up on-the-fly. :P This approach would probably also be overkill compared to any other method you might be able to use if you could be more specific about the problem.)

If you're generally trying to design or imagine a system which corrects errors in a signal, you can use any number of established methods.

If you yourself don't have a definite idea of what would be a "100% correct" output for your given examples, you might write an algorithm which assigns probabilities to different outputs - and then you can view all possible outputs and/or choose the output with the highest assigned probability (e.g. Given your HTHTTHHHTTHHH example above, you might assign "TTTTTHHHHHHHH" a probability of 51% and "HHHHHHHHHHHHH" a probability of 49%.)

As for the apparent conflict between your HTHTTHHHTTHHH and AYAYAYYYY examples: you might factor in the Ts at positions 9 and 10 (the last two Ts in the sequence) to add weight to the presence of Ts at the beginning of the sequence - or maybe the fact that the two Ts at positions 4 and 5 are next to each other adds weight to that counting as a sequence, etc.
Was This Post Helpful? 0
  • +
  • -

#10 Skydiver   User is offline

  • Code herder
  • member icon

Reputation: 7239
  • View blog
  • Posts: 24,539
  • Joined: 05-May 12

Re: Sequence of characters problem

Posted 12 February 2020 - 07:20 AM

Also, another question about the problem domain? Will there only ever be two types of symbols in the input string? Or is it possible to have more than two symbols? e.g. Is this allowed as input:
AAABBBCCC


Was This Post Helpful? 0
  • +
  • -

#11 arturmuller   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 10
  • Joined: 11-February 20

Re: Sequence of characters problem

Posted 12 February 2020 - 09:33 AM

@Skydiver, yes there can be any character of the latin alphabet. The problem with defining the rules is, as you are very correct in your previous statement. This type of scenario doesnt fall in place with it:

ATTAATAATTTTTTTTTTTT => AAAAAAAATTTTTTTTTTTT

Because the A's are never fully "interupted" and the sequences of A's after removing the T's still gets long (more than 3 chars as well). This is tricky because the algorithm needs to "look ahead" when its processing characters left-to-right.

I am trying to design my own algorithm for this but should be something already I could learn from.

Thank you very much for your help
Was This Post Helpful? 0
  • +
  • -

#12 Skydiver   User is offline

  • Code herder
  • member icon

Reputation: 7239
  • View blog
  • Posts: 24,539
  • Joined: 05-May 12

Re: Sequence of characters problem

Posted 12 February 2020 - 10:37 AM

I suggest presenting your transforms in code tags and as two separate lines to make it easier to see the diff between the input and output: So for your post #11, it would be:
ATTAATAATTTTTTTTTTTT =>
AAAAAAAATTTTTTTTTTTT



and from post #1:
AAACCAACA =>
AAAAAAAAA

ACACAAAAA =>
AAAAAAAAA

AYYYYYYYA =>
YYYYYYYYY

YYYYAAYYY =>
YYYYYYYYY

AYAYAYYYY =>
YYYYYYYYY

AYAYAYAYAAAAAAAAAAYAYYYYYYYYYYYYYY =>
AAAAAAAAAAAAAAAAAAYYYYYYYYYYYYYYYY

HTHTTHHHTTHHH =>
TTTTTHHHHHHHH

TTOAOAOOAATTA =>
OOOOOOOOAAAAA


Was This Post Helpful? 0
  • +
  • -

#13 Skydiver   User is offline

  • Code herder
  • member icon

Reputation: 7239
  • View blog
  • Posts: 24,539
  • Joined: 05-May 12

Re: Sequence of characters problem

Posted 12 February 2020 - 10:52 AM

I'm not seeing anything C# specific here. I'm moving over into Computer Science forum.
Was This Post Helpful? 0
  • +
  • -

#14 Skydiver   User is offline

  • Code herder
  • member icon

Reputation: 7239
  • View blog
  • Posts: 24,539
  • Joined: 05-May 12

Re: Sequence of characters problem

Posted 12 February 2020 - 11:01 AM

I wonder if you can take some inspiration from looking at some of the edit distance algorithms and trying to minimize the hamming distance between your input, and some subset of acceptable resulting strings. The string with the lowest hamming distance must be the correct answer.
Was This Post Helpful? 0
  • +
  • -

#15 Ornstein   User is offline

  • D.I.C Head

Reputation: 32
  • View blog
  • Posts: 64
  • Joined: 13-May 15

Re: Sequence of characters problem

Posted 12 February 2020 - 12:22 PM

Did you have a more specific idea of how exactly you might use the Hamming distance and/or how you'd establish the "subset of acceptable resulting strings" to test against?

Take one example given by the OP:

AYAYAYYYY
---------
YYYYYYYYY | 3
AAAAAYYYY | 2


This illustrates the input, two possible outputs and their respective Hamming distances. According to the OP, the "correct" output has the higher Hamming distance here.

Unless the OP is willing to compromise on what exactly they consider to be the right and wrong answer? Are the examples above set in stone?

So far, the only way I can make sense of the OP's examples is if (as I alluded to earlier) you were to place some significance on two of the three letters in the sequence (especially at the beginning?) being adjacent. If that's sufficient, you could probably write an algorithm (a pretty simple one) to do what the OP is asking. I just have a feeling that the OP would still manage to find some convoluted new example where this algorithm wouldn't work. :P
Was This Post Helpful? 1
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2