# Sequence of characters problem

• (2 Pages)
• 1
• 2

## 28 Replies - 907 Views - Last Post: 15 February 2020 - 07:46 AM

### #16 Skydiver

• Code herder

Reputation: 7239
• Posts: 24,539
• Joined: 05-May 12

## Re: Sequence of characters problem

Posted 12 February 2020 - 01:08 PM

I didn't have more specific idea beyond edit distance, and in the back of my head, applying some kind of dynamic programming to explore the possible substitutions to come up with the "acceptable strings" at the leaves of the search tree. I only looked up hamming distance because it was the edit distance calculation that only took account substitutions.

Good counter example above where minimal hamming distance is not the desired result.

Perhaps if should be some kind of weighted hamming distance where the cost of substitutions are more expensive if the replacement letter is more expensive if it is different from its neighbors. So it'll be cheapest when both left and right neighbors are the same as the replacement, and most expensive when completely different. And then when the OP 3 character limit is changed to 4 characters, then the weighting also changes to take into account more neighbors.

Just speculating here. I wish the OP would provide a more precise set of rules rather than just showing examples.

### #17 Ornstein

Reputation: 32
• Posts: 64
• Joined: 13-May 15

## Re: Sequence of characters problem

Posted 12 February 2020 - 01:36 PM

For poops and giggles I cobbled together a quick implementation in Javascript which works for all the examples given here so far; it's about 100 lines of (very hacky) code.

Reading back, the OP mentioned earlier that they had already started writing their own algorithm? Maybe they can instead post their progress/struggles with that.

### #18 arturmuller

Reputation: 0
• Posts: 10
• Joined: 11-February 20

## Re: Sequence of characters problem

Posted 12 February 2020 - 02:08 PM

Hi,

I managed to visualize this a little better by plotting a graph of my data. I am working with ML.NET and would really appreciate if I could also do the correction with it.

Here are two pictures, first is the visualization of my data, and second is expected "correction"/result. The teal marked areas are segments without any characters. I should have a setting of how wide a segment has to be to qualify, like the yellow spike makes it with a small margin.

https://imgur.com/a/HruSxTx

Like I said, can I achieve the calculation with ML.NET?

Best regards
Artur

### #19 modi123_1

• Suitor #2

Reputation: 15497
• Posts: 62,056
• Joined: 12-June 08

## Re: Sequence of characters problem

Posted 12 February 2020 - 02:13 PM

### #20 arturmuller

Reputation: 0
• Posts: 10
• Joined: 11-February 20

## Re: Sequence of characters problem

Posted 12 February 2020 - 03:10 PM

Well, I have been trying with different ML.NET algorithms and none of them seems to work for this. One other thing is that this need to run on .NET Core/.NET Standard and it seems like some of these ML algorithms needs win environment.

Would be good if I could be pointed to some algorithm that could correct like shown in my picture.

### #21 modi123_1

• Suitor #2

Reputation: 15497
• Posts: 62,056
• Joined: 12-June 08

## Re: Sequence of characters problem

Posted 12 February 2020 - 03:13 PM

Which have you tried?

It sounds like you are just throwing random ML concepts at this and praying for some sort of result; which is not good.

### #22 arturmuller

Reputation: 0
• Posts: 10
• Joined: 11-February 20

## Re: Sequence of characters problem

Posted 12 February 2020 - 03:51 PM

Well it does not necessarily need to be ML. Can you propose some other algorithm?

### #23 Skydiver

• Code herder

Reputation: 7239
• Posts: 24,539
• Joined: 05-May 12

## Re: Sequence of characters problem

Posted 12 February 2020 - 06:34 PM

I'm sorry for being so dense, but how is this visualization supposed to be interpreted:

Is the Y-axis (labeled "Char") of 0 to 25 representative of the letters A-Z?

What does the X-axis (labeled "Sample") that ranges from 0 to 1600 represent? Is that the time that a particular letter was seen? Is that in milliseconds?

What are the vertical bars? In particular what do the left and right edges of the vertical bars mean?
For example: There looks to be a single bar with a height of 6 running from X:600 to X:700. Does that mean letter 'G' was seen from time 600 to time 700? What about the 4 relatively thick bars and a single thin bar also with height 6, but spanning the time 0 to 100? Does that mean the letter 'G' was seen 5 times -- 4 times for relatively long periods, as well as 1 quick flash? What does the lack of a bar between time 270 to 280 mean? Does that mean the letter 'A' was seen during that period?

### #24 arturmuller

Reputation: 0
• Posts: 10
• Joined: 11-February 20

## Re: Sequence of characters problem

Posted 13 February 2020 - 02:49 AM

Is the Y-axis (labeled "Char") of 0 to 25 representative of the letters A-Z? Yes

What does the X-axis (labeled "Sample") that ranges from 0 to 1600 represent? Is that the time that a particular letter was seen? Is that in milliseconds? This is time series yes when that particular letter was received, in seconds.

What are the vertical bars? In particular what do the left and right edges of the vertical bars mean? The left and right edges means a transition of characters a "break", this is part of what I am trying to identify.

For example: There looks to be a single bar with a height of 6 running from X:600 to X:700. Does that mean letter 'G' was seen from time 600 to time 700? That means that there is noise in this character sequence, all from X:500 until approx X:1350 should be the same character but due to noise/sampling errors we get to see this bars/holes.

What about the 4 relatively thick bars and a single thin bar also with height 6, but spanning the time 0 to 100? Does that mean the letter 'G' was seen 5 times -- 4 times for relatively long periods, as well as 1 quick flash? This is same as above, we see that G should be from X:0 to approx X:250, but due to noise etc the bars gets split up.

What does the lack of a bar between time 270 to 280 mean? Does that mean the letter 'A' was seen during that period? This means that no character was seen during this period and should be identified as well.

All we see in the graph are bars of characters appearing under a certain time, and they get split up due to noise/sampling errors.

Thank you for helping out !

Best regards
Artur

### #25 Skydiver

• Code herder

Reputation: 7239
• Posts: 24,539
• Joined: 05-May 12

## Re: Sequence of characters problem

Posted 13 February 2020 - 05:23 AM

So is Y:0 == A, or is Y:0 == no signal? If the latter, why did you say the range 0 to 25 corresponded to A to Z?

Also, in the original problem, are the letter sequences each sample, or are they quantized? For that time 0-100 is that 5 G's or just a single G?

### #26 arturmuller

Reputation: 0
• Posts: 10
• Joined: 11-February 20

## Re: Sequence of characters problem

Posted 13 February 2020 - 11:56 AM

So is Y:0 == A, or is Y:0 == no signal? Y==0 is no signal, Y==1 is A.

If the latter, why did you say the range 0 to 25 corresponded to A to Z? I just plotted to 25 since the higher characters were not present at all in this sample.

Also, in the original problem, are the letter sequences each sample, or are they quantized? For that time 0-100 is that 5 G's or just a single G? They are quantized.

Tips of an algorithm?

### #27 Skydiver

• Code herder

Reputation: 7239
• Posts: 24,539
• Joined: 05-May 12

## Re: Sequence of characters problem

Posted 13 February 2020 - 07:11 PM

Why don't you share what you've currently got?

It looks like Ornstein has come up with some code to tackle the problem. I don't know what algorithm he has implemented.

As for me, I originally considering playing with edit distances. After seeing your visualization, I may try doing some noise filtering/reduction algorithms that try to preserve edges. It seems to me that your magic limit of 3 characters (or whatever you set as the limit) might be a good edge indicator.

### #28 arturmuller

Reputation: 0
• Posts: 10
• Joined: 11-February 20

## Re: Sequence of characters problem

Posted 14 February 2020 - 11:18 AM

Anyone can suggest such a noise filtering algorithm?

### #29 Skydiver

• Code herder

Reputation: 7239
• Posts: 24,539
• Joined: 05-May 12

## Re: Sequence of characters problem

Posted 15 February 2020 - 07:46 AM