Quantcast
Channel: Active questions tagged windows - Super User
Viewing all articles
Browse latest Browse all 9774

How to compress huge but almost identical files?

$
0
0

I have 3 files that are 30GB each. The files themselves contain relatively high-entropy data and thus aren't easily compressible beyond 75%. Each file however is at least 95% identical to each other, there is only a couple GB different between them.

What are my expectations?I expect it to be possible to compress these 3 files to at least 30GB for all 3 compressed. This seems very realistic to me since I know the first 29GB is identical between all of them. In theory then it should be possible to just ignore that first part of the two other files before feeding it into a compression algorithm. That would result in the size to compress being 32GB of unique data and that should give me a file that's smaller than 30GB.

Why doesn't it work with regular solid compression then?My hypothesis for this is that the dictionary is limited to about 512MB as it's the highest I can do with the amount of RAM I have to my disposal.

The best I could do was compress the files on highest settings and end up with a ~70GB file. This is nowhere near as small as I want it to get but I am not surprised that this is the result from my understanding of the compression algorithms of WinRAR and 7zip.

Using solid compression I hoped it might have more success. To my understanding solid compression (not solid block size but full solid) means the files are concatenated in the archive like a tarball. That would mean the files are treated as one single data stream. the downside to that is that the entire archive has to be read to extract one file. The supposed upside is that the shared data between the archives is compressed as it's seen as duplicate.

The reason I think why solid compression isn't working on my 30GB files is because the dictionary gets saturated and it has to create a new dictionary. Once it gets to the next file it doesn't appear to check the first dictionary anymore and continues with the last dictionary which no longer contains the duplicate data that was in the first section of the first file. Of course it would take far longer to check all dictionaries, this solid compression clearly isn't the way to go due to a memory restriction. Perhaps there is some compression algorithm that instead of storing the actual data in a dictionary, stores a pointer to a piece of unique data in the data array to a dictionary and checks the start of the data at that pointer against newly read data to consider if a repeat starts, or does this for each file that gets added separately for super optimal compression with multiple files.

I am looking for a compression algorithm that doesn't compress each file separately or as one data stream, but an algorithm that expects each file to be nearly identical and thus uses the dictionaries in the same order for each file, then builds on top of that first dictionary that was generated for the first file for a second file. This would make compressing for example full raw images of hard drives far more efficient as it only stores the piece of data that is different between the two versions on top of the size of one image, so then you have a base image and an incremental image that contains only the changes on top of that. Then this can be solid compressed using LZMA or RAR and should in theory compress my 3 files into less than a single one.

Can this be done using existing software, doesn't matter which operating system? I don't like having to reinvent the wheel for basic tasks like this. I can make my own archive format but I don't trust myself in making something consistent and reliable, I really can't risk losing any data in the process of making this. Is it possible with some settings in 7zip, WinRAR or any other well known software? I'd also be fine with an easy manual script to complete this task by perhaps first comparing all the files and then writing the difference to a separate file and archiving that while also saving the exact locations of the difference compared to the base image with all the metadata known about each file (all the timestamps in 100ns format and all file attributes, original name and path), but I need to be 100% sure I can get the exact data that went into the archive back out.


Viewing all articles
Browse latest Browse all 9774

Trending Articles