[Translation] Another best ZIP bomb

[Translation] Another best ZIP bomb


The article shows how to create non-recursive zip-bomb , which provides a high degree of compression by overlapping files inside the zip-container. “Non-recursive” means that it does not depend on the recursive decompressing by decompressor of files embedded in zip-archives: there is only one round. The output size increases quadratically from the input, reaching a compression ratio of over 28 million (10 MB → 281 TB) within the zip format. An even larger expansion is possible with 64-bit extensions. The design uses only the most common DEFLATE compression algorithm and is compatible with most zip parsers.


Source Code:
 git clone https://www.bamsoftware.com/git/zipbomb.git 
zipbomb-20190702.zip

Data and source illustrations:
 git clone https://www.bamsoftware.com/git/zipbomb-paper.git 

non-recursive recursive
archive size uncompressed size ratio uncompressed size ratio
Queen Cox 440 440 1.0
Queen Ellingsen 28809 42569 1.5
42.zip 42374 * 558432 13.2 4507981343026016 106 billion
this technique 42374 5461307620 129 thousand 5461307620 129 thousand
this technique 9893525 281395456244934 28 million 281395456244934 28 million
this technique (Zip64) 45876952 4507981427706459 98 million 4507981427706459 98 million

* There are two versions of 42.zip: old 42,374 bytes , and newer is 42,838 bytes. The difference is that a new password is required before unpacking. We compare only with the old version. Here's a copy of the file if you need one: 42.zip .

** I would like to know and indicate the author of 42.zip, but could not find it - let me know if you have any information.


Zip bombs must overcome the fact that the most commonly supported parsers compression algorithm DEFLATE can not exceed compression ratio of 1032 to 1. By For this reason, zip bombs usually rely on recursive decompression, putting zip files into zip files to get an additional factor of 1032 with each layer. But the trick works only in implementations that unpack recursively, but most do not. The most famous bomb 42.zip expands to a formidable 4.5 PB, if all six layers are recursively unpacked, but on the upper layer it is a trifle 0 , 6 MB. Zip-Kuin like Cox and Ellingsen , give out a copy of themselves and, thus, expand infinitely when recursively unpacking. But they are also completely safe when unpacking once.

This article shows you how to create a non-recursive zip-bomb whose compression ratio exceeds the DEFLATE limit of 1032. It works by overlapping files inside a zip-container to refer to the “core” of highly compressed data in several files without making multiple copies. The output size of the zip-bomb grows quadratically from the input size; that is, the compression ratio improves as the size of the bomb increases. The design depends on the features of zip and DEFLATE: it is not transferred directly to other file formats or compression algorithms. The bomb is compatible with most zip parsers, except for “streaming”, which analyze files in one pass, without checking the central directory of the zip file. We try to balance two conflicting goals:

  • Increase compression ratio. We define the compression ratio as the sum of the sizes of all files in the archive divided by the size of the zip file itself. It does not take into account file names or other file system metadata, but only the content.
  • Save compatibility. Zip is a complex format, and parsers are different, especially in border situations and with additional features. Do not use techniques that work only with certain parsers. We’ll point out some ways to improve the efficiency of a zip-bomb with a certain loss of compatibility.

zip file structure


The zip file consists of a central directory of links to files .



The central directory is at the end of the zip file. This is a list of central directory headers . Each central directory header contains single-file metadata, such as the file name and CRC-32 checksum, as well as a reverse pointer to the local file header. The central directory header has a length of 46 bytes plus the length of the file name.

A file consists of a local file header followed by compressed file data. The length of the local file header is 30 bytes plus the length of the file name. It contains a redundant copy of the metadata from the header of the central catalog, as well as the sizes of the compressed and uncompressed data files behind it. Zip is a container format, not a compression algorithm. The data for each file is compressed using the algorithm specified in the metadata, usually DEFLATE .

This description of the zip format omits many details that are not needed to understand the zip-bomb. For full details, see Section 4.3 APPNOTE.TXT or " PKZip File Structure " by Florian Buchholz, or see source code .

Considerable redundancy and a lot of ambiguities in the zip format open up possibilities for the mischief of various kinds. Zip-bomb - this is only the tip of the iceberg. Links for further reading:


First discovery: overlapping files


By compressing a long string of duplicate bytes, we can create a core of highly compressed data.By itself, the kernel compression ratio cannot exceed the DEFLATE limit of 1032, so we need a way to reuse the kernel in many files, without creating a separate copy of it in each file. We can do this by overlapping files: make sure that many of the headers of the central directory point to a single file whose data is the core.



Consider an example of how this design affects the degree of compression. Suppose a 1000 byte kernel is unpacked into 1 MB. Then the first megabyte of output is “worth” 1078 bytes of input:

  • 31 bytes for the local file header (including the 1-byte file name)
  • 47 bytes for the central directory header (including the 1-byte file name)
  • 1000 bytes per core

But every 1 MB of output after the first one is only 47 bytes - we don’t need another local file header or another copy of the kernel, just an additional header of the central directory. Thus, while the first link from the kernel has a compression ratio of 1,000,000/1078 ≈ 928, each additional link moves the factor closer to 1,000,000/47 ≈ 21,277, and the large core raises the ceiling.

The problem with this idea is the lack of compatibility. Since many central directory headers point to a single local file header, the metadata (in particular, the file name) cannot be the same for each file. Some parsers swear at this . For example, Info-ZIP UnZip (standard unzip on Unix) extracts files, but with warnings:

 $ unzip overlap.zip
  inflating: A
 B: mismatching "local" filename (A),
  continuing with "central" filename version
  inflating: B
 ... 

Python's zipfile is also throws an exception :

 $ python3 -m zipfile -e overlap.zip.
 Traceback (most recent call last):
 ...
 __main __. BadZipFile: File name in directory 'B' and header b'A 'differ. 

Next, we look at how to change the design for consistency of file names, while retaining most of the advantages of overlapping files.

Second discovery: quoting local file headers


We need to separate the headers of the local files for each file, while reusing one core. Simply merging all the headers doesn't work, because the zip parser will find that local file header where it waits for the beginning of the DEFLATE stream. But the idea will work, with a few changes. We will use the DEFLATE function of uncompressed blocks to “quote” local file headers so that they appear to be part of the same DEFLATE stream that ends in the kernel. Each local file header (except the first one) will be interpreted in two ways: as a code (part of the zip file structure) and as data (part of the file's contents).



The DEFLATE stream is a sequence of blocks , where each block can be compressed or uncompressed. We usually think only of compressed blocks, for example, the core is one large compressed block. But there are uncompressed ones that start with a 5-byte header with a length field, which means simply: "Print the following n bytes verbatim." Unpacking an uncompressed block only means deleting a 5-byte header. Compressed and uncompressed blocks can be freely mixed in the DEFLATE stream.The output is a concatenation of the results of unpacking all the blocks in order. The term “uncompressed” has meaning only at the DEFLATE level; File data is still considered “compressed” at the zip level, regardless of which blocks are used.

The easiest way to present this construction as an internal overlap, from the last file to the first. We start by inserting the kernel, which will form the end of the data file for each file. Add a local N LFH file header and a N central directory header that points to it. Set the “compressed size” metadata field in the LFH N and CDH N to the compressed kernel size. Now add the 5-byte header of the uncompressed block (green on the diagram), the length field of which is equal to the size of the LFH N . Add the second header of the local LFH file N −1 and the header of the central directory CDH N −1 , which points to it . Set the “compressed size” metadata field as a new header of a compressed kernel size plus size of an uncompressed block header (5 bytes) plus LFH size N .

At the moment, the zip file contains two files with the names Y and Z. Let's go through what the parser sees when parsing. Assume that the compressed size of the kernel is 1000 bytes, and the size of the LFH N is 31 bytes. We start with a CDH N −1 and follow the pointer to the LFH N −1 . The name of the first file is Y, and the compressed size of its data file is 1036 bytes. Interpreting the next 1036 bytes as a DEFLATE stream, we first encounter a 5-byte header of an uncompressed block that says to copy the next 31 bytes. We write the following 31 bytes, which are LFH N , which we unpack and add to file Y. Moving further in the DEFLATE stream, we find the compressed block (core), which we unpack to a file Y. Now we have reached the end of the compressed data and finished with the file Y. Moving on to the next file, we follow the pointer from CDH N to the LFH N and find the file with the name Z, the compressed size of which is 1000 bytes. Interpreting these 1000 bytes as a DEFLATE stream, we immediately encounter a compressed block (core again) and unpack it into file Z. Now we have reached the end of the final file and finished. The output file Z contains the unpacked kernel; The output file Y is the same, but additionally with a prefix of 31 bytes LFH N .

We complete the construction by repeating the citation procedure until the zip archive includes the necessary number of files. Each new file adds a central directory header, a local file header and an uncompressed block for quoting the next local file header itself. Compressed file data is usually a chain of uncompressed DEFLATE blocks (quoted local file headers), followed by a compressed kernel. Each byte in the core contributes about 1032 N to the output size, because each byte is part of all N files. Output files are also of different sizes: the earlier ones are larger than the later ones, because they cite local file headers more. The content of the output files does not make much sense, but no one said that it should make sense.

This citation design with interleaving has better compatibility than the full overlap design from the previous section, but compatibility is achieved due to the degree of compression. There, each added file cost only the central directory header, here it stands for the central directory header, the local file header, and another 5 bytes for the citation header.

Optimization


Having obtained the basic design of the zip-bomb, we will try to make it as efficient as possible.We want to answer two questions:

  • What is the maximum compression ratio for a given zip file size?
  • What is the maximum compression ratio, given the limitations of the zip format?

Compress kernel


It is beneficial for us to compress the kernel as much as possible, because each unpacked byte is multiplied by N . For this purpose, we use a custom DEFLATE compressor called bulk_deflate, specialized in compressing a string of duplicate bytes.

All decent DEFLATE archivers are approaching a compression ratio of 1032 on an endless stream of duplicate bytes, but we are concerned about a specific size. In our archive size, bulk_deflate holds more data than general-purpose archivers: about 26 KB more than zlib and Info-ZIP, and about 15 KB more than Zopfli , which sacrifices speed for the sake of compression quality.



The price of the high compression ratio bulk_deflate is the lack of versatility. It can compress only lines of duplicate bytes and only of a certain length, namely 517 + 258 k for an integer k ≥ 0. In addition to good compression, bulk_deflate works quickly, doing work almost at the same time, regardless of the size of the input data, not counting the work on the actual recording of the compressed string.

Filenames


For our purposes, file names are almost dead weight. Although they contribute to the output size, being part of the quoted local file headers, but the byte in the file name contributes much less than the byte in the kernel. We want the file names to be as short as possible while being different, not forgetting compatibility.

Each byte spent on a file name means two bytes not spent on the kernel (two, because each file name appears twice: in the header of the central directory and in the header of the local file). The file name byte results in an average of only ( N + 1)/4 bytes of output, while the byte in the kernel is counted as 1032 N . Examples: 1 , 2 , 3 .

The first compatibility consideration is encoding. The zip format specification states that file names should be interpreted as CP 437 or UTF-8 if a certain flag bit is set ( APPNOTE .Txt, Appendix D ). This is the main point of incompatibility between zip parsers, which can interpret file names in some fixed or locale-specific encoding. Therefore, for compatibility, it is better to limit to characters with the same encoding in both CP 437 and UTF-8. Namely, these are 95 printable US-ASCII characters.

We are also bound by file system naming restrictions. Some file systems are case-insensitive, so 'a' and 'A' are not considered different names. Common file systems such as FAT32, prohibit certain characters , such as '*' and '?'.

As a safe, but not necessarily optimal compromise, our zip-bomb will use 36-character alphabet file names, which do not include special characters and symbols of different register:

 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 

File names are generated in an obvious way, all the positions in turn, with the addition of a position at the end of the cycle:

 "0", "1", "2", ..., "Z",
 "00", "01", "02", ..., "0Z",
 ...,
 "Z0", "Z1", "Z2", ..., "ZZ",
 "000", "001", "002", ... 

There are 36 single-character file names, 36² two-character names, and so on. Four bytes is enough for 1,727,604 different file names.

Given that the file names in the archive will usually have different lengths, how should they be better arranged: from the shortest to the longest, or vice versa? If you think a little, it is better to put the longest names last. This ordering adds more than 900 MB of output to zblg.zip , compared to the ordering from the longest to the shortest. However, this is a minor optimization, because 900 MB is only 0.0003% of the total issue size.

Core Size


The overlap citation design allows you to place a compressed data core, and then cheaply copy it many times. For a specific size of the zip file X , how much space is optimally allocated for storing the kernel, and how much to create copies?

To find the optimal balance, you need to optimize only one variable N , the number of files in the zip archive. Each value of N requires a certain amount of overhead for central directory headers, local file headers, citation block headers and file names. All the rest of the space will be occupied by the core. Since N must be an integer, and you can only place a certain number of files before the kernel size drops to zero, it is enough to check each possible value of N and choose what gives highest issuance.

Applying an optimization procedure to X = 42,374 for 42.zip finds a maximum with N = 250. These 250 files require 21,195 bytes of service data, leaving 21,179 bytes for the kernel. A kernel of this size is unpacked into 21,841,249 bytes (a ratio of 1031.3 to 1). 250 copies of the unpacked kernel plus slightly quoted local file headers give a total unpacked output of 5,461,307,620 bytes and a compression ratio of 129,000.

 zipbomb --mode = quoted_overlap --num-files = 250 --compressed-size = 21179 & gt;  zbsm.zip 

The optimization resulted in an almost uniform distribution of space between the core and the file headers. This is no coincidence. Consider a simplified model of the construction of quotation with overlap. In the simplified model, we ignore the file names, as well as a slight increase in the size of the output file due to quoting local file headers. An analysis of the simplified model will show that the optimal distribution between the core and file headers is approximately even, and the output size with optimal distribution grows quadratically depending on the size of the input.

Definition of some constants and variables:

X zip file size (considered fixed)
N number of files in a zip archive (variable for optimization)
CDH = 46 central directory header size (without file name)
LFH = 30 local file header size (without file name)
Q = 5 uncompressed DEFLATE block header size
C ≈ 1032 kernel compression ratio

Let H (N) be the volume of the overhead to the headers for N files. See the diagram to understand the essence of the formula.

$ H (N) = N ⋅ (CDH + LFH) + (N - 1) Q $


For the kernel, there is X - H (N) space. The total unpacked size of S X (N) is equal to the size of N copies of the kernel, unpacked with the ratio C (in this simplified models we ignore a minor extra extension from the local file headers mentioned).

$$ display $$ S_X (N) = (X - H (N)) CN \\ = (X - (N ⋅ (CDH + LFH) + (N - 1) Q)) CN \\ = - (CDH + LFH + Q) CN ^ 2 + (X + Q) CN $$ display $$


S X (N) is a polynomial in part N, therefore the maximum should be where the derivative of S ' X (N) is zero. Taking the derivative and finding zero gives us N OPT , the optimal number of files.

$$ display $$ S′X (N_ {OPT}) = −2 (CDH + LFH + Q) CN_ {OPT} + (X + Q) C \\ 0 = −2 (CDH + LFH + Q) CN_ {OPT} + (X + Q) C \\ N_ {OPT} = (X + Q)/(CDH + LFH + Q)/2 $$ display $$


H (N OPT ) gives the optimum amount of space to accommodate file headers. It is independent of CDH, LFH and C and is close to X/2 .

$$ display $$ H (N_ {OPT}) = N_ {OPT} (CDH + LFH) + (N_ {OPT} - 1) ⋅ Q \\ = (X - Q)/2 $$ display $$


S X (N OPT ) - total unpacked size with optimal distribution. From this we see that the issue size grows quadratically with increasing input data.

$ S_X (N_ {OPT}) = (X + Q) ^ 2C/(CDH + LFH + Q)/4 $


Increasing the size of the zip file, we will eventually encounter the limits of the zip format: the archive can contain no more than 2 16 −1 files no larger than 2 32 −1 bytes each . Worse, some implementations take maximum values ​​as an indicator of 64-bit extensions , so our limits are actually 2 16 −2 and 2 32 −2. It happens that the first we are faced with a limit on the size of an uncompressed file. With a zip file size of 8,319,377 bytes, a naive optimization will give us the number of files 47,837 and the maximum file 2 32 311 bytes.

(In fact, everything is a bit more complicated, because the exact limits depend on the implementation. Python zipfile ignores the number of files, the archive/zip in Go allows to increase the number of files until they have the lower 16 bits. But for general compatibility, we must adhere to the established limits).

If we cannot infinitely increase N or the size of the core, we would like to find the maximum compression ratio within the zip format.You need to make the kernel as large as possible with the maximum number of files. Despite the fact that we can no longer maintain a roughly even separation between the kernel and file headers, each added file still increases the compression ratio — just not as fast as if we could continue to grow the kernel. In fact, as we add files, we will need to reduce the core size in order to make room for the maximum file size, which grows a little with each file added.

The plan leads to a zip archive with 2 16 −2 files and a kernel, which is unpacked into 2 32 −2 178 825 bytes. Files will be larger towards the beginning of the zip archive — the first and largest file is unpacked at 2 32 −56 bytes. This is as close as we can get using coarse output parameters of bulk_deflate - encoding the final 54 bytes will cost more than their size (the zip file generally has a compression ratio of 28 million, and the final 54 bytes will receive a maximum of 54 10 32 ⋅ (2 16 - 2) ≈ 36–5 million bytes, so it only helps if 54 bytes can be encoded into one byte — and I could not encode less than two Therefore, if you can not encode 54 bytes in 1 byte, it only reduces the degree of compression). The size of this output zip bomb is 281,395,456,244,934 bytes, 99.97% of the theoretical maximum (2 32 - 1) × (2 16 - 1). Any significant improvement in compression ratio can only be achieved by reducing the size of the input signal, and not increasing the output signal.

 zipbomb --mode = quoted_overlap --num-files = 65534 --max-uncompressed-size = 4292788525 & gt;  zblg.zip 

Efficient CRC-32 calculation


Among the metadata in the central directory header and the local file header is the checksum of the uncompressed file data — CRC-32 . This creates a problem because the amount of CRC-32 computing for each file is proportional to its size, which is by default very large (after all, it's a zip-bomb). We would prefer to do work that is at least proportional to the size of the archive. Two factors work in our favor: all files have a common core, and an uncompressed kernel is a string of duplicate bytes. Imagine CRC-32 as a matrix product - this will allow not only to quickly calculate the kernel checksum, but also to reuse calculations between files. The method described in this section is a small extension of the crc32_combine function in zlib, which Mark Adler explains here .

CRC-32 can be modeled as a state machine, updating a 32-bit status register for each input bit. Basic update operations for bits 0 and 1 are:

  uint32 crc32_update_0 (uint32 state) {
//Shift out the least significant bit.
  bit b = state & amp;  one;
  state = state & gt; & gt;  one;
//If the shifted-out bit was 1, XOR with the CRC-32 constant.
  if (b == 1)
  state = state ^ 0xedb88320;
  return state;
 }

 uint32 crc32_update_1 (uint32 state) {
//Do as for a 0 bit, then XOR with the CRC-32 constant.
  return crc32_update_0 (state) ^ 0xedb88320;
 }  

If you represent the state register as a 32-element binary vector and use XOR for addition and multiplication, then crc32_update_0 is linear mapping ; that is, it can be represented as a multiplication by a binary transition matrix 32 × 32.To understand why, note that multiplying a matrix by a vector is simply the sum of the columns of the matrix after multiplying each column by the corresponding element of the vector. The shift operation is state & gt; & gt; 1 simply takes each bit of the i state vector and multiplies it by a vector that is zero everywhere except for the bit i - 1 (the numbering of the bits is from right to left). Relatively speaking, the final XOR state ^ 0xedb88320 occurs only when the b bit is one. This can be represented as the first multiplication of b by 0xedb88320, and then XORing into this state.

Also, crc32_update_1 is just a crc32_update_0 plus (XOR) constant. This makes crc32_update_1 affine transformation : matrix multiplication followed by mapping (i.e., vector addition ). We can imagine matrix multiplication and mapping in one step if we increase the size of the transformation matrix to 33 × 33 and add an additional element to the state vector, which is always 1 (this view is called uniform coordinates ).


33 × 33 M 0 transformation matrices and M 1 , which compute the state change of CRC-32 performed by bits 0 and 1, respectively. Column vectors are stored with the most significant bit at the bottom: reading the first column from the bottom up, you see the polynomial constant CRC-32 edb8832016 = 111 0 11 0 11 0 111 000 1 00000 11 00 1 00000 2 . These two matrices differ only in the final column, which represents the translation vector in homogeneous coordinates. In M 0 , the translation is zero, and in M ​​ 1 - edb88320 16 , the polynomial constant CRC-32. The units immediately above the diagonal represent the state of the operation state & gt; & gt; 1

Both crc32_update_0 and crc32_update_1 operations can be represented by a 33 × 33 transition matrix. Matrices M 0 and M 1 are shown. The advantage of the matrix representation is that matrices can be multiplied. Suppose we want to see a state change made by processing an ASCII character 'a', whose binary representation is 01100001 2 . We can represent the cumulative state change of the CRC-32 of these eight bits in a single transformation matrix:

$ M_a = M_0 M_1 M_1 M_0 M_0 M_0 M_0 M_1 $


And we can imagine changing the state of the string repeating 'a' by multiplying many copies of M a - raising the matrix to a power. We can quickly do this using the fast exponentiation algorithm , which allows us to calculate M n total in log 2 n steps. For example, here is a matrix for changing the state of a string of 9 characters 'a':

$ (M_a) ^ 9 = M_a M_a M_a M_a M_a M_a M_a M_a M_a \\ = (M_a M_a M_a M_a) ^ 2 M_a \\ = (  (M_a M_a) ^ 2) ^ 2 M_a \\ = (((M_a) ^ 2) ^ 2) ^ 2 M_a $


The fast matrix multiplication algorithm is useful for calculating the M kernel , the matrix for the uncompressed kernel, since the kernel is a string of duplicate bytes. To get the CRC-32 checksum from the matrix, multiply the matrix by the zero vector (the zero vector is in homogeneous coordinates, that is, 32 zeros and then one; here we omit the minor complication of pre- and post-processing the checksum to check for consistency).To calculate the checksum for each file, we work in the opposite direction. We start with the initialization of M: = M kernel . The kernel checksum is also the checksum of the final N file, so we multiply M by the zero vector and store the resulting checksum in the CDH N and LFH N . The data of the file N − 1 is the same as the file data of the file N , but with the added prefix LFH N . Therefore, we calculate the $ M_ {L {FH_N}} $ , the state transition matrix for LFH N and update  $ M: = M M_ {L {FH_N}} $ . The M now represents a cumulative change of state from the processing of LFH N behind the core. We calculate the checksum for the file N − 1 , again multiplying M by the zero vector. Continuing the procedure, accumulating state change matrices in M until all files have been processed.

Extension: Zip64


Previously, we ran into an extension problem due to the limitations of the zip format - it was impossible to issue more than 281 TB, no matter how cleverly the zip file was packed. You can exceed these limits by using Zip64 , a zip extension that increases the size of some header fields to 64 bits . Zip64 support is by no means universal, but it is one of the most commonly implemented extensions. As for the compression ratio, the effect of Zip64 is to increase the size of the central directory header from 46 to 58 bytes, and the size of the local directory header from 30 to 50 bytes. Looking at the formula for the optimal expansion coefficient in the simplified model, we see that the zip-bomb in the Zip64 format still grows quadratically, but slower because of the larger denominator - this is seen in the diagram below. Due to the loss of compatibility and slower growth, almost any restrictions on file size are removed.

Suppose we need a zip-bomb that expands to 4.5 PB, like 42.zip. How big should the archive be? Using a binary search, we find that the minimum size of such a file is 46 MB.

  • zbxl.zip 46 MB → 4.5 PB (Zip64, less compatible)
 zipbomb --mode = quoted_overlap --num-files = 190023 --compressed-size = 22982788 --zip64 & gt;  zbxl.zip 

4.5 petabyte - approximately as much data was recorded by the Telescope of the event horizon for first black hole image : racks and racks with hard drives in the data center.

With Zip64, it’s almost uninteresting to consider the maximum compression ratio, because we can simply continue to increase the size of the zip file and the compression ratio with it, until even a compressed zip file becomes prohibitively large. An interesting threshold, however, is 2 64 bytes (18 EB or 16 EiB) - so much data will not fit in most file systems. Binary search finds the smallest zip bomb that produces at least as much output: it contains 12 million files and a compressed 1.5 GB kernel. The total size of the zip file is 2.9 GB, and it is unpacked at 2 64 +11,727,895,877 bytes with a compression ratio of more than 6.2 billion to one. I did not upload this file for download, but you can generate it yourself using the source code . It has files of such size that even bug in Info-ZIP UnZip 6.0 was revealed.

 zipbomb --mode = quoted_overlap --num-files = 12056313 --compressed-size = 1482284040 --zip64 & gt;  zbxxl.zip 

Extension: bzip2


DEFLATE is the most common compression algorithm for the zip format, but this is only one of many options. Probably the second most common algorithm is bzip2 . Although it is not as compatible as DEFLATE. Theoretically, in bzip2, the maximum compression ratio is about 1.4 million to one, which allows for a closer packing of the core.

bzip2 first of all encodes a “run step” (run-length encoding), reducing the length of a string of duplicate bytes by 51 times. Then the data is divided into blocks of 900 KB and each block is compressed separately. Theoretically, one block can shrink to 32 bytes. 900,000 × 51/32 = 1 434 375.

Ignoring the loss of compatibility, does bzip2 allow a more efficient bomb to be made?

Yes - but only for small files. The problem is that in bzip2 there is nothing like the uncompressed DEFLATE blocks that we used to cite local file headers. Thus, it is impossible to overlap files and reuse the kernel - for each file you need to write your own copy, so the overall compression ratio will be no better than the coefficient for any single file. On the graph below, we see that without overlapping bzip2 exceeds DEFLATE only for files of about a megabyte.

There is only hope for an alternative means of citing headlines in bzip2, which is discussed in the next section. In addition, if you know that a particular zip parser supports bzip2 and allows for file name mismatches, you can use a full overlap structure that does not need to be quoted.


Comparison of the compression ratio of different zip-bombs. Pay attention to the logarithmic axis. Each design is shown with and without Zip64. In structures without overlap, the linear growth rate is evident from the constant ratio of the axes. The vertical displacement of the bzip2 graphic means that the compression ratio of bzip2 is about a thousand times greater than that of DEFLATE. The DEFLATE citation constructs have a quadratic growth rate, as evidenced by a slope to the 2: 1 axes. The Zip64 option is slightly less efficient, but allows you to issue more than 281 TB. Charts for bzip2 with quoting through an additional field are transferred from quadratic to linear when either the maximum file size is reached (2 32 −2 bytes) or the maximum allowed number of files

Expansion: quoting through an extra field


So far, we have used the DEFLATE function to quote local file headers, and we just saw that this trick does not work in bzip2. However, there is an alternative way of quoting, somewhat more limited, which uses only functions of the zip format and does not depend on the compression algorithm.

At the end of the local file header structure, there is a additional field of variable length for storing information that does not fit into regular header fields ( APPNOTE.TXT, Section 4.3.7 ). Additional information may include, for example, a time stamp or a uid/gid from Unix. Zip64 information is also stored in an additional field. The additional field is represented as a length-value structure; if you increase the length without adding a value, then the additional field will include what is behind it in the zip file, namely the next header of the local file. Using this method, each local file header can “quote” the following headers, wrapping them in its own additional field.Compared to DEFLATE, there are three advantages:

  1. Quoting through an extra field requires only 4 bytes, not 5, leaving more room for the kernel.
  2. It does not increase the size of the files, which means a larger kernel size, given the limitations of the zip format.
  3. It provides citations in bzip2.

Despite these advantages, quoting through additional fields is less flexible. This is not a chain, as in DEFLATE: each local file header must contain not only the next header itself, but all subsequent headers. Additional fields increase as you get closer to the beginning of the zip file. Because the maximum field length is 2 16 −1 bytes, you can only quote up to 1808 local file headers (or 1170 in Zip64), assuming that names are assigned as expected (in the case of DEFLATE, you can use the additional field to cite the first (shortest) headers of local files, and then switch to the quotes DEFLATE for the rest). Another problem: to comply with the internal data structure of the additional field, you need to select a 16-bit tag for the type ( APPNOTE.TXT, section 4.5 .2 ) preceding the citation data. We want to select a type tag that will cause parsers to ignore data in quotes, rather than trying to interpret them as meaningful metadata. Zip parsers should ignore tags of an unknown type, so we can choose tags randomly, but there is a risk that in future some tag will break the compatibility of the construction.

The previous diagram illustrates the possibility of using additional fields in bzip2, c and without Zip64. On both graphs there is a break point when growth goes from quadratic to linear. Without Zip64, this happens where the maximum uncompressed file size is reached (2 32 −2 bytes); then you can increase only the number of files, but not their size. The graph completely ends when the number of files reaches 1809, then we have a place in the additional field for citing additional headers. With Zip64, a fracture occurs on 1171 files, after which only the size of the files can be increased, but not their number. The additional field helps in the case of DEFLATE, but the difference is so small that it is not visually noticeable. It increases the compression ratio of zbsm.zip by 1.2%; zblg.zip by 0.019%; and zbxl.zip by 0.0025%.

Discussion


In their work on this topic Pletz with colleagues use file overlap to create an almost self-replicating zip archive. The file overlap suggested earlier (slide 47) Ginvael Coldwind.

We have developed a zip-bomb design with overlapping citations with regard to compatibility - a number of differences in implementations, some of which are shown in the table below. The resulting construction is compatible with zip-parsers that work in the usual way, that is, first checking the central directory and using it as an index of files. These include a unique zip parser from Nail , which is automatically generated from a formal grammar. However, the construction is incompatible with “streaming” parsers that analyze the zip file from beginning to end in one run without first reading the central catalog. By their nature, streaming parsers do not allow file overlap in any way. Most likely, they will extract only the first file. In addition, they may even give an error, as in the case of sunzip , which analyzes the central directory at the end and checks it for consistency with the headers local files that he has already seen.

If you want the extracted files to start with a specific prefix, different from the bytes of the local file header, you can insert a DEFLATE block in front of the uncompressed block that quotes the next header. Not every file in a zip archive should be involved in creating a bomb: you can include regular files in the archive, if necessary, to fit some special format (in the source code there is a - template option for this option use). Many formats use zip as a container, for example, Java JAR documents, Android APK and LibreOffice.

PDF is a lot like zip. It has a cross-reference table at the end of the file that points to previous objects, and it supports compressing objects through the FlateDecode filter. I haven't tried it, but you can use the idea of ​​overlap citation to make a PDF bomb. Perhaps you don’t even have to work much: binaryhax0r writes in a blog that you can simply specify several layers FlateDecode on one object, after which the creation of a PDF bomb becomes trivial.

Determining the specific class of the zip-bomb described in the article is easy: just find the overlapping files. Mark Adler wrote a patch for Info-ZIP UnZip that does just that. However, in general, blocking overlapping files does not protect against all classes of zip bombs. It is difficult to predict in advance whether the file is a zip-bomb or not, if you do not have accurate knowledge of the internal components of the parsers that will be used to analyze it. Peering into the headers and adding fields to the "uncompressed size" of all files does not work , because the value in the headers may not coincide with the actual uncompressed size ( in the compatibility table, see the line “allows too small file size”). Reliable protection against zip-bombs includes limits on time, memory and disk space in a zip-parser during its operation. Take parsing of zip files like any complex operation with unreliable data with caution.


A table listing some zip parsers, as well as various functions, border situations, and zip bomb designs. For better compatibility, use DEFLATE compression without Zip64, match the names in the headers of the central directory and the headers of the local files, calculate the correct CRC values ​​and avoid the maximum values ​​of 32-bit and 16-bit fields.

Acknowledgments


Thanks to Mark Adler , Russ Cox , Brandon Enright , Marek Maykovsky , Josh Wolf and reviewers USENIX WOOT 2019 for comments on draft this article. Kaolan McNamara evaluated the impact of zip-bombs on LibreOffice security.

A version of this article has been prepared for the USENIX WOOT 2019 seminar. Source Code is available. Artifacts for presentation at the seminar are in the file zipbomb-woot19.zip .

Did you find a system that falls into one of the bombs? Did the bombs help you find a bug or make money in a bug search program? Let me know, I will try to mention it here.

LibreOffice 6.1.5.2


After renaming zblg.zip to zblg.odt or zblg.docx, the LibreOffice program creates and deletes a series of temporary files of about 4 GB each, trying to determine the file format. Ultimately, he finishes doing this and deletes the temporary files as they arrive, so the zip-bomb only causes a temporary DoS, without filling the disk. Kaolan McNamara responded to my error message.

Mozilla addons-server 2019.06.06


I tried zip-bombs against the local addons-server installation, which is part of the addons.mozilla.org software. The system processes the bomb gracefully, imposing a time limit of 110 seconds on extracting files. The zip-bomb expands as fast as the disk allows up to this time limit, but then the process is killed and the unpacked files are eventually automatically cleared.

UnZip 6.0


Mark Adler wrote a patch for UnZip to discover this class of zip bombs.

July 5, 2019: I noticed that a vulnerability was assigned to UnZip CVE-2019-13232 .Personally, I would argue that the ability/inability of UnZip (or any zip parser) to handle this kind of zip-bomb necessarily represents a vulnerability or even a bug. This is a natural implementation that does not violate the specification, what can I say. The type of bomb in this article is just one type, and there are many other ways that you can puzzle the zip parser. As mentioned above, if you want to protect against resource depletion attacks, you should not attempt to list, detect, and block every single known attack; rather, it is necessary to establish external restrictions on time and other resources so that the parser does not get into such a situation, regardless of the attack with which it came. There is nothing wrong with trying to detect and reject certain constructions as an optimization of the first pass, but you cannot stop there. If you ultimately do not isolate and restrict unreliable data operations, your system is probably still vulnerable. Consider the analogy of HTML cross-site scripting: the right protection is not to try to filter out specific bytes that can be interpreted as code, but to avoid everything correctly.

Antivirus engines


Twitter user @TVqQAAMAAAAEAAA reports : "McAfee anti-virus on my test machine just exploded." I did not check it myself and I do not have such details as the version number.

Tavis Ormandi indicates that in varansusto.com/gui/file/f1dc9208e, see the changes, the changes will be 2008./detection "> VirusTotal for zblg.zip there is a series of timeouts ( screenshot of June 6, 2019 years ): AhnLab-V3, ClamAV, DrWeb, Endgame, F-Secure, GData, K7AntiVirus, K7GW, MaxSecure, McAfee, McAfee-GW-Edition, Panda, Qihoo-360, Sophos ML, VBA32. screenshot of June 6, 2019 ) are similar, but with a different set of timed engines: Baido, Bkav, ClamAV, CMC, DrWeb, Endgame, ESET-NOD32 , F-Secure, GData, Kingsoft, McAfee-GW-Edition, NANO-Antivirus, Acronis. Interestingly, in the results of zbxl.zah replaying by zbxl.zd by 5d32a86817744/detection"> the results of zbxl.zp ihd you replica bamsoftware.com/hacks/zipbomb/vt-zbxl-20190706.png"> screenshot of June 6, 2019 ). Perhaps some antiviruses do not support Zip64? Several engines detect files as a kind of “compression bomb”. It’s interesting to see if they will still do this with small changes, such as changing the file names or adding the ASCII prefix to each file.

Final Statement


It's time to put an end to Facebook. This is not a neutral job for you: every day when you come to work, you do something wrong. If you have a Facebook account, delete it. If you work on Facebook, have fun.

And do not forget that the National Security Agency should be destroyed.

Source text: [Translation] Another best ZIP bomb