CVE-2025-66019: LZW Decompression DoS Vulnerability in pypdf Library

PDF files use various algorithms to compress their content. This compression reduces file size but also carries some security risks. While conducting security research, I discovered a DoS (Denial of Service) vulnerability in the pypdf library’s LZW (Lempel-Ziv-Welch) decompression implementation. In this post, I’ll first explain how the LZW algorithm works and why it leads to this vulnerability, then explain why other algorithms like ZLIB don’t suffer from this issue.

How Does LZW Decompression Algorithm Work?

LZW (Lempel-Ziv-Welch) is a compression algorithm developed in 1984. It’s widely used in PDFs because it can compress repeating patterns very efficiently.

Basic Principle: Dictionary-Based Compression

LZW works by using a dictionary (codebook). The algorithm adds previously seen patterns to the dictionary and represents them with code numbers when encountered again.

Simple example:

Let’s say we want to compress this text: "ABABABAB"

Interactive LZW Compression Demo

Enter text:

Dictionary:

During decompression, we decode these codes to recover the original text.

LZW Decompression Process

LZW decompression reconstructs the dictionary by reading compressed data:

1. Initialize dictionary (standard ASCII for first 256 characters)
2. Read first code and write to output
3. For each new code:
   a. Look up code in dictionary
   b. If found, write to output
   c. Create new pattern (previous + new character)
   d. Add new pattern to dictionary

Visual example:

Interactive LZW Decompression Demo

Compressed data: [65, 66, 256, 257, 256]

Dictionary:

Output:

LZW’s Strength: High Compression Ratios

LZW can achieve very high compression ratios when working with repeating patterns. For example:

High Compression Ratio Demo

Enter text (e.g., type "AAAAAAAAAA..." for "A" × 100):

This feature makes LZW ideal for PDFs because PDFs often contain repeating patterns (spaces, same colors, etc.).

Why Does LZW Decompression Lead to DoS Vulnerability?

LZW’s high compression capability also creates a security risk. Here’s why:

1. Dictionary Growth and Memory Usage

During LZW decompression, the dictionary continuously grows. Each new pattern is added to the dictionary and these patterns are held in memory.

Small compressed data: 5 MB
Dictionary size: Can grow (theoretically unlimited)
Decompressed data: Can reach up to 950 MB

The problem: pypdf holds all decompressed data in memory. A small PDF file can occupy a very large memory space after decompression.

2. Exploiting Repeating Patterns

An attacker can turn LZW’s strength into a vulnerability by creating data with especially repeating patterns:

Attacker data:
"ABCABCABCABC..." (repeated 1 million times)

LZW compression:
1. "ABC" → dictionary[256]
2. "BCA" → dictionary[257]
3. "CAB" → dictionary[258]
4. "ABC" → already exists, use code 256
5. ... (very high compression ratio)

Result:
- Compressed: ~50 KB
- Decompressed: 3 MB
- Compression ratio: 60:1

With larger patterns, this ratio can increase even more.

3. pypdf’s Current Implementation

pypdf’s LZW decompression implementation works like this:

# pypdf/_codecs/_codecs.py
class LzwCodec(Codec):
    def decode(self, data: bytes) -> bytes:
        output_stream = io.BytesIO()
        output_length = 0
        
        # Create dictionary
        dictionary = {i: bytes([i]) for i in range(256)}
        
        # Decompression loop
        while True:
            code = read_next_code()
            if code in dictionary:
                decoded = dictionary[code]
            else:
                # Special case: code not in dictionary yet
                decoded = previous + previous[0:1]
            
            output_stream.write(decoded)
            output_length += len(decoded)
            
            # Limit check
            if output_length > self.max_output_length:
                raise LimitReachedError(...)
            
            # Add new pattern to dictionary
            new_pattern = previous + decoded[0:1]
            dictionary[next_code] = new_pattern
        
        return output_stream.getvalue()  # All data in memory!

Critical issue: The output_stream.getvalue() call holds all decompressed data in memory. This causes significant memory consumption for large files.

4. Root Cause: Holding All Data in Memory

By nature, LZW decompression doesn’t require holding all output in memory. However, pypdf’s current implementation holds all data in a BytesIO buffer and returns it all at once with getvalue().

Why is this a problem?

Streaming approach is possible: LZW decompression can be done in a streaming fashion (in chunks)
Memory efficiency: For large files, streaming significantly reduces memory usage
DoS risk: Holding all data in memory allows small PDFs to cause large memory consumption

Why Doesn’t ZLIB Suffer from This Issue?

ZLIB (and DEFLATE) uses a different approach than LZW. This difference makes ZLIB more resilient to this type of DoS attack.

ZLIB/DEFLATE Algorithm

ZLIB uses the DEFLATE algorithm. DEFLATE is a two-stage compression method:

LZ77 (Lempel-Ziv 77): Finds repeating patterns using a sliding window
Huffman Coding: Encodes found patterns more efficiently

Key difference: ZLIB uses a sliding window, while LZW uses a growing dictionary.

Sliding Window vs Growing Dictionary

Dictionary vs Window Comparison

LZW (Growing Dictionary)

Dictionary Size: 256 entries

ZLIB (Sliding Window)

Window Size: 32 KB (fixed)

ZLIB’s Resilience to DoS

1. Limited Window Size:

ZLIB’s sliding window has a fixed size (typically 32 KB). This prevents the dictionary from growing unlimited.

LZW: Dictionary → can grow unlimited
ZLIB: Window → fixed 32 KB

2. Lower Compression Ratios:

ZLIB cannot achieve compression ratios as high as LZW (especially for repeating patterns):

Repeating pattern: "ABCABCABC..."
LZW compression ratio: 20-50:1
ZLIB compression ratio: 5-10:1

This makes it harder for attackers to create large memory consumption with small files.

3. pypdf’s ZLIB Limit:

pypdf uses a lower limit for ZLIB:

# pypdf/filters.py
ZLIB_MAX_OUTPUT_LENGTH = 75_000_000  # 75 MB
LZW_MAX_OUTPUT_LENGTH = 1_000_000_000  # 1 GB (13.3x higher!)

This inconsistency makes LZW more risky.

Comparison Table

Feature	LZW	ZLIB/DEFLATE
Dictionary/Window	Growing dictionary	Fixed sliding window
Maximum size	Theoretically unlimited	Fixed (32 KB)
Compression ratio (repeating pattern)	20-50:1	5-10:1
Memory usage (decompression)	High (all data)	Lower (streaming possible)
DoS risk	High	Lower
pypdf limit	1 GB	75 MB

CVE-2025-66019: Vulnerability Details

Vulnerability Summary

CVE-2025-66019 is a DoS vulnerability found in the pypdf library’s LZW decompression implementation. The library had added a limit mechanism with a previous security fix (CVE-2025-62708), but the 1 GB limit value is still insufficient for practical security.

Key Findings:

✅ Limit mechanism exists (added in version 6.1.3)
⚠️ Limit value (1 GB) is too high for practical security
⚠️ Inconsistency with ZLIB limit (75 MB vs 1 GB)
✅ Proof of Concept demonstrates exploitability

Affected Components

File: pypdf/_codecs/_codecs.py (LzwCodec class)
File: pypdf/filters.py (LZWDecode class)
Current Limit: LZW_MAX_OUTPUT_LENGTH = 1_000_000_000 (1 GB)

Patched Versions

This vulnerability has been fixed in pypdf >= 6.4.0. Affected users are advised to update the library.

Practical Attack Scenario

An attacker can:

Create 950 MB decompressed data (repeating patterns)
Compress with LZW to create a ~4-20 MB PDF file
Upload to server

Server (pypdf):

Reads small PDF (20 MB)
Decompresses LZW stream
Allocates 950 MB memory
CPU-intensive processing (1-3 minutes)
Memory exhaustion → DoS

Proof of Concept Results

Test PDF (100 MB decompressed):

PDF file size: ~2-5 MB (compressed)
Decompressed size: ~100 MB
Processing time: 5-15 seconds
Memory usage: +100-150 MB

Full Impact PDF (950 MB decompressed):

PDF file size: ~4-20 MB (compressed)
Decompressed size: ~950 MB
Processing time: 1-3 minutes
Memory usage: +950-1100 MB

Compression Ratio Demonstration

Demo results:

Simple repeating data: 18.87:1 compression
Pattern-based data: 19.74:1 compression
Large compressible data: 219.28:1 compression
Random data: 0.74:1 (LZW performs poorly)

Impact Assessment

Affected Environments

Serverless Functions (AWS Lambda, Cloud Functions):

Memory limits: 128 MB - 10 GB
Impact: Memory exhaustion, crashes, cost increase

API Endpoints (PDF processing services):

Concurrent requests
Impact: Resource depletion, service degradation

Embedded Systems:

Limited RAM (256 MB - 1 GB)
Impact: System crashes, denial of service

Web Applications:

PDF upload/processing features
Impact: Server slowdown, memory exhaustion

Conclusion

CVE-2025-66019 is a DoS vulnerability found in the pypdf library. The root cause of the vulnerability is LZW algorithm’s high compression capability and pypdf’s approach of holding all decompressed data in memory. Algorithms like ZLIB are less affected by this issue due to their sliding window approach. This vulnerability has been fixed in pypdf >= 6.4.0.

Related content:

Resources: