First of all, I want to let clear that I do understand the counting argument.

However, it's clear there are simple algorithms (mathematical functions, I should say) that have the potential to output a relatively large file. One such example is the famous zip bomb, but that's not useful at all (the opposite, actually).

What am I trying to say? Well, suppose amongst all existing MP3 files one happens to be, when you interpret the bits that make up the file as a number, a number that is equal to 97^5049 + 3 (completely random example). Well, then there is a way to create a small piece of software that actually computes 97^5049 + 3 and retrieves back this original file (note that I'm talking specifically about this file).

I'm aware of the formal definition of the Kolmogorov complexity. It states, however, that it's impossible to obtain it. What I want in this thread is a nice discussion related to practical approaches for extreme data compression "techniques" such as the mentioned above.

I've thought of walking backwards (given a number that represents a file or part of it, try to find the function), but not implemented yet; what I'm trying at the moment is generating a large binary string using a simple function and "hoping" that it matches the bit string that represents the file, or a substring. Again, of course it's impossible to come up with an algorithm that compresses any file. I'm using the famous Lena image in the tests; my approach is trying to find a big power of a prime number whose binary interpretation is a relatively big substring of the binary string that represents the file. There probably isn't, and it's going to take some time to find one if there is one, but, once found, it's easy to "decompress"; powering numbers is quite computationally cheap.

Also, the German .kkrieger game is very intriguing; the developers used procedural generation and managed to create a 3D fps game in 97 KB. However, it's noticeable that it's not data compression, it's data generation.

I'd like some general information about data compression techniques, and if any relates to what I exposed above.

But frankly, a good discussion would be awesome. Any idea is welcome.

## 16 Replies - 856 Views - Last Post: 26 June 2019 - 05:02 AM

##
**Replies To:** Extreme data compression with specific data

### #2

## Re: Extreme data compression with specific data

Posted 25 June 2019 - 07:00 AM

This sounds like you want to hash a file into a single number, then reverse that hash into the full file?

### #3

## Re: Extreme data compression with specific data

Posted 25 June 2019 - 08:02 AM

Kind of like that. What I propose is, given a specific file, interpret it (the bits that make it up, actually) as a number and try to find a function that generates it without taking too much space in disk (possibly the smallest one, but Kolmogorov states it's impossible to find).

I've come up with some ideas already; if we consider a substring of the bit string that represents the number (i.e. the file, whatever), and it happens to be the output of a function which can be expressed using less space, there is room for compression.

Another approach I thought of was powering prime numbers (such that no repeated bistrings are generated, and maybe a boolean to flip the bits or not, given that the binary representation of any number will start with 1) and trying to match this binary representation with a part of the file (or XORing it with every substring of the file, in order to find how much discrepancy there is, and adjusting accordingly afterwards).

Is there any theory that states that for any specific file there is a way to find a function that generates it and occupies less space? (or that there isn't, for every file)

Has anyone ever tried something similar? The discussion is open!

I've come up with some ideas already; if we consider a substring of the bit string that represents the number (i.e. the file, whatever), and it happens to be the output of a function which can be expressed using less space, there is room for compression.

Another approach I thought of was powering prime numbers (such that no repeated bistrings are generated, and maybe a boolean to flip the bits or not, given that the binary representation of any number will start with 1) and trying to match this binary representation with a part of the file (or XORing it with every substring of the file, in order to find how much discrepancy there is, and adjusting accordingly afterwards).

Is there any theory that states that for any specific file there is a way to find a function that generates it and occupies less space? (or that there isn't, for every file)

Has anyone ever tried something similar? The discussion is open!

### #4

## Re: Extreme data compression with specific data

Posted 25 June 2019 - 08:19 AM

I don't think there is a generic solution for this. It sounds like you want to apply something similar to the way wavelets are applied to audio. Wavelets are fitted against the sound form, and it's the characteristics of the fitting waves which are stored. In this case, though it seems like you are trying to find a function that models the entire file, or as much of the file as possible.

### #5

## Re: Extreme data compression with specific data

Posted 25 June 2019 - 11:10 AM

Yes, that's kind of what I'm proposing. A better formulation of it would be: given a fixed random binary sequence (that means, we take any file in the world or we randomly generate one), how could we find out an algorithm (or a function) that outputs the sequence (there are infinitely many), ane how to find some that require less space on disk? (Kolmogarov states it's impossible, but maybe some heuristic works well in practice)

### #6

## Re: Extreme data compression with specific data

Posted 25 June 2019 - 04:55 PM

If you don't mind lossy compression, treat the source file as the input wave form and simply apply the same algorithm used to create MP3 files.

It sounds like you want lossless compression, though.

It sounds like you want lossless compression, though.

### #7

## Re: Extreme data compression with specific data

Posted 25 June 2019 - 05:06 PM

Cross posted in StackExchange.

### #8

## Re: Extreme data compression with specific data

Posted 25 June 2019 - 06:15 PM

Yes, Skydiver, I posted there as well. I am unsure whether there is a problem with that, but I can easily solve it by deleting this thread (or any moderator).

I am referring to lossless compression of specific data. One can easily program a piece of software that has a seemingly "crazy" function in it, feed in a number through it, and receive a gigantic number as output. The binary representation of this output can be stored in a file, apparently random.

What I'm curious about (subject of this question/discussion) is ways to reverse this process. That is, given the binary representation of a file (which describes a number, usually gigantic), work backwards on it and find a function that generates it (or at least a considerable part of it). If that is achievable, although probably through a very long process, then it's possible to easily generate the original file via this function, and this function can be very compact on disk.

What are the theories behind this? And is my reasoning flawed? If so, how?

Thanks everyone for reading.

I am referring to lossless compression of specific data. One can easily program a piece of software that has a seemingly "crazy" function in it, feed in a number through it, and receive a gigantic number as output. The binary representation of this output can be stored in a file, apparently random.

What I'm curious about (subject of this question/discussion) is ways to reverse this process. That is, given the binary representation of a file (which describes a number, usually gigantic), work backwards on it and find a function that generates it (or at least a considerable part of it). If that is achievable, although probably through a very long process, then it's possible to easily generate the original file via this function, and this function can be very compact on disk.

What are the theories behind this? And is my reasoning flawed? If so, how?

Thanks everyone for reading.

### #9

## Re: Extreme data compression with specific data

Posted 25 June 2019 - 07:31 PM

We just usually leave notices here about cross-postings so that other people can decide whether it is worth their time and effort to respond here, if somebody is already getting help elsewhere.

As you probably know, in data compression the first step is to create a model of the data to be compressed. That data modeling take take different forms: statistical, analytical, self-similarities, etc.

I think what you are trying to achieve is to do some curve fitting on the data in the hopes of finding a function and then just encoding the coefficients and exponents of that function.

I can give you a counter example that shows that this is impossible to do: If the file contains completely random data (note: not pseudo-random) from some natural phenomenon like nuclear decay or sunspots, then by definition there is no function which will replicate that data. If there were a way to replicate the data, then it is not random.

Another approach you could take instead of trying to find a function, is perhaps treat the entire file as a huge N-bit number. If you can factor that huge N-bit number into a small set of x-bit numbers where x < N / 2, then you will be on to something. Of course, if you figure out how to factor huge N-bit numbers with current hardware, you'll likely have your name published in every crypto publication. If you have a quantum computer, then Shor's algorithm will be your ticket to getting those factors.

But why limit yourself to just factoring? You could try to see if the bit patterns within the file match up with the bit patterns of transcendental numbers like pi? Then this just becomes a simple pattern matching problem. You would likely burn a few oceans doing this, though and still not come up with a match, then you'll have to cut the file into smaller blocks and repeat the process of trying to find matches again.

As you probably know, in data compression the first step is to create a model of the data to be compressed. That data modeling take take different forms: statistical, analytical, self-similarities, etc.

I think what you are trying to achieve is to do some curve fitting on the data in the hopes of finding a function and then just encoding the coefficients and exponents of that function.

I can give you a counter example that shows that this is impossible to do: If the file contains completely random data (note: not pseudo-random) from some natural phenomenon like nuclear decay or sunspots, then by definition there is no function which will replicate that data. If there were a way to replicate the data, then it is not random.

Another approach you could take instead of trying to find a function, is perhaps treat the entire file as a huge N-bit number. If you can factor that huge N-bit number into a small set of x-bit numbers where x < N / 2, then you will be on to something. Of course, if you figure out how to factor huge N-bit numbers with current hardware, you'll likely have your name published in every crypto publication. If you have a quantum computer, then Shor's algorithm will be your ticket to getting those factors.

But why limit yourself to just factoring? You could try to see if the bit patterns within the file match up with the bit patterns of transcendental numbers like pi? Then this just becomes a simple pattern matching problem. You would likely burn a few oceans doing this, though and still not come up with a match, then you'll have to cut the file into smaller blocks and repeat the process of trying to find matches again.

### #10

## Re: Extreme data compression with specific data

Posted 25 June 2019 - 07:52 PM

I would think you would want to start the forward process to get to the backwards. Hand waving the "binary representation of a file" is not a solid step. How is this representation created? Does it guarantee uniqueness? Does it guarantee something smaller than the actual bits the make up the file? What are the rules going forward to create this representation.

not knowing that then yes? No? The answer is up in the air.

not knowing that then yes? No? The answer is up in the air.

### #11

## Re: Extreme data compression with specific data

Posted 25 June 2019 - 08:35 PM

Skydiver,what is this definition of random I've never come across? I mean, of course there is no function that generates a random value; a function is, by its definition, injective. And the nature of any general-purpose compression algorithm is bijective (hence the counting argument). Okay, but I'm talking about one specific file. Regardless of the source of its bits, once the bits are defined they stay the same way forever.

Mathematically speaking, we know there are infinite functions that map to a specific number. This number represents our specific file. Our goal is to find a function that generates this number whose trade-off with execution speed and size is acceptable.

modi123_1, I'm currently trying the forward process (even though the backwards process seems more interesting to me, but it's also more complex). I'm iterating over powers of prime numbers and checking if the bits that represent this binary number match with some substring of the file. So far I've managed to find only 34 bits matching... But I intend to incorporate some error-correction technique, in the sense of analysing the ratio of correct bits we got for our generated string when it's positioned in some index at the binary string that represents the file. Then, if its ratio is "good" (e.g. the length of the bitstring generated happened to coincide with all but 2 bits of a particular substring of the file, and its length is 5000), then we can just specify the index, the prime, the power and the corrections (to flip the bits, each correction will take log2(lengthOfSubstring) bits to represent), and we're able to recreate that substring.

Also, as the binary representation of any number always starts with 1, I've thought of a boolean that tells us whether to flip all the bytes of the generated substring or not. As every number has only one prime factorization, every pair prime+power will generate a unique binary substring (and, as consequence, its flipped version will also be unique).

Another important thing to mention is that, as we can now flip all the bits of the substring if we wish, the maximum quantity of errors we can make is floor of L/2, L being the length of the substring (if we make more than half of L errors, say we only get 4 bits right, we can just flip every bit and get all but 4 bits right).

Mathematically speaking, the correction is actually summing or subtracting the power lf the prime by a power of 2. When the bit being corrected is 1 at index n, we're subtracting 2^n, and when it's 0 we're adding 2^n.

Theoretically speaking, can this yield some result? (I mean, some result it will surely yield; assume that "some result" means some good result)

Mathematically speaking, we know there are infinite functions that map to a specific number. This number represents our specific file. Our goal is to find a function that generates this number whose trade-off with execution speed and size is acceptable.

modi123_1, I'm currently trying the forward process (even though the backwards process seems more interesting to me, but it's also more complex). I'm iterating over powers of prime numbers and checking if the bits that represent this binary number match with some substring of the file. So far I've managed to find only 34 bits matching... But I intend to incorporate some error-correction technique, in the sense of analysing the ratio of correct bits we got for our generated string when it's positioned in some index at the binary string that represents the file. Then, if its ratio is "good" (e.g. the length of the bitstring generated happened to coincide with all but 2 bits of a particular substring of the file, and its length is 5000), then we can just specify the index, the prime, the power and the corrections (to flip the bits, each correction will take log2(lengthOfSubstring) bits to represent), and we're able to recreate that substring.

Also, as the binary representation of any number always starts with 1, I've thought of a boolean that tells us whether to flip all the bytes of the generated substring or not. As every number has only one prime factorization, every pair prime+power will generate a unique binary substring (and, as consequence, its flipped version will also be unique).

Another important thing to mention is that, as we can now flip all the bits of the substring if we wish, the maximum quantity of errors we can make is floor of L/2, L being the length of the substring (if we make more than half of L errors, say we only get 4 bits right, we can just flip every bit and get all but 4 bits right).

Mathematically speaking, the correction is actually summing or subtracting the power lf the prime by a power of 2. When the bit being corrected is 1 at index n, we're subtracting 2^n, and when it's 0 we're adding 2^n.

Theoretically speaking, can this yield some result? (I mean, some result it will surely yield; assume that "some result" means some good result)

### #12

## Re: Extreme data compression with specific data

Posted 25 June 2019 - 09:12 PM

Yes, I understood that in the file once the boys were defined, they were defined forever. What I was trying to say was that the way the bits were defined in the file was that they came from a real random noise source. Imagine the recording of cosmic noise or sun spots or a couple hundred lava lamps.

### #13

## Re: Extreme data compression with specific data

Posted 25 June 2019 - 09:29 PM

Since you are considering bit patterns from powers of primes, and their negatives when you flip all the bits, why are you not considering bit patterns in reverse (e.g. from least significant bit to most significant bit, instead of just most significant to least significant)?

When you mentioned that you were working specifically with the "Lena" image, I assume that you are trying to compress the raw pixel data, and not a pre-processed file that has already been RLE'd or pre-compressed in some other way. If you are working with the raw image, have you also considered searching for bit patterns that may match going down (or up) the pixel columns instead of just the traditional left to right scanlines? Or have you also tried to match on a diagonal pattern like the way some image compression algorithms work?

Any which way, you are correct that you need to find a "good" ratio of matching bits vs. bits needed to encode the prime and associated parameters. I'm just very doubtful that this approach will lead to something that is performant enough to run in a reasonable amount of time.

When you mentioned that you were working specifically with the "Lena" image, I assume that you are trying to compress the raw pixel data, and not a pre-processed file that has already been RLE'd or pre-compressed in some other way. If you are working with the raw image, have you also considered searching for bit patterns that may match going down (or up) the pixel columns instead of just the traditional left to right scanlines? Or have you also tried to match on a diagonal pattern like the way some image compression algorithms work?

Any which way, you are correct that you need to find a "good" ratio of matching bits vs. bits needed to encode the prime and associated parameters. I'm just very doubtful that this approach will lead to something that is performant enough to run in a reasonable amount of time.

### #14

## Re: Extreme data compression with specific data

Posted 25 June 2019 - 09:42 PM

My test file is Lena's raw TIFF image (the uncompressed one), but it could be any other one. I'm not particularly interested in searching for patterns (e.g. RLE, Huffman Coding etc).

I still don't get what's the difference between a random generated file and any othee file (since randomness can generate any file, although unlikely to be a specific one), given my approach.

Regarding the performance, I agree it's too slow to compress. But very straightforward to decompress; it takes an assymptotically logarithmic time to raise a prime to a certain power, and the correctness are cheap as well.

I'm not considering the inverse of the sequence because maybe there is some other sequence (also power of a prime) whose bit representation happens to be the inverse.

However, any approach is valid; hence my feeling that working backwards may be better.

I still don't get what's the difference between a random generated file and any othee file (since randomness can generate any file, although unlikely to be a specific one), given my approach.

Regarding the performance, I agree it's too slow to compress. But very straightforward to decompress; it takes an assymptotically logarithmic time to raise a prime to a certain power, and the correctness are cheap as well.

I'm not considering the inverse of the sequence because maybe there is some other sequence (also power of a prime) whose bit representation happens to be the inverse.

However, any approach is valid; hence my feeling that working backwards may be better.

### #15

## Re: Extreme data compression with specific data

Posted 25 June 2019 - 09:49 PM

modi123_1, the binary representation of a file is what you'd get if you expand the bytes that make up the file. Ex: FF000100 eould give us a binary sequence of 11111111000000000000000100000000. And yes, this does guarantee uniqueness (hence the bits of a file are what define the file).

I am unsure every specific file has an algorithmic way of being represented with less bits through a function (we have to count the numbers of bits that make up the rxecutable, that is, what is related to the turing machine used). That is one of the reasons I started this thread.

I am unsure every specific file has an algorithmic way of being represented with less bits through a function (we have to count the numbers of bits that make up the rxecutable, that is, what is related to the turing machine used). That is one of the reasons I started this thread.