CVE-2025-66019: LZW Decompression DoS Vulnerability in pypdf Library
PDF files use various algorithms to compress their content. This compression reduces file size but also carries some security risks. While conducting security research, I discovered a DoS (Denial of Service) vulnerability in the pypdf library’s LZW (Lempel-Ziv-Welch) decompression implementation. In this post, I’ll first explain how the LZW algorithm works and why it leads to this vulnerability, then explain why other algorithms like ZLIB don’t suffer from this issue.
How Does LZW Decompression Algorithm Work?
LZW (Lempel-Ziv-Welch) is a compression algorithm developed in 1984. It’s widely used in PDFs because it can compress repeating patterns very efficiently.
Basic Principle: Dictionary-Based Compression
LZW works by using a dictionary (codebook). The algorithm adds previously seen patterns to the dictionary and represents them with code numbers when encountered again.
Simple example:
Let’s say we want to compress this text: "ABABABAB"
Interactive LZW Compression Demo
During decompression, we decode these codes to recover the original text.
LZW Decompression Process
LZW decompression reconstructs the dictionary by reading compressed data:
1. Initialize dictionary (standard ASCII for first 256 characters)
2. Read first code and write to output
3. For each new code:
a. Look up code in dictionary
b. If found, write to output
c. Create new pattern (previous + new character)
d. Add new pattern to dictionary
Visual example:
Interactive LZW Decompression Demo
LZW’s Strength: High Compression Ratios
LZW can achieve very high compression ratios when working with repeating patterns. For example:
Original data: "AAAAA...AAAAA" (1000 characters)
LZW compression: [65, 256, 257, 258, ...] (~50 codes)
Compression ratio: 20:1
This feature makes LZW ideal for PDFs because PDFs often contain repeating patterns (spaces, same colors, etc.).
Why Does LZW Decompression Lead to DoS Vulnerability?
LZW’s high compression capability also creates a security risk. Here’s why:
1. Dictionary Growth and Memory Usage
During LZW decompression, the dictionary continuously grows. Each new pattern is added to the dictionary and these patterns are held in memory.
Small compressed data: 5 MB
Dictionary size: Can grow (theoretically unlimited)
Decompressed data: Can reach up to 950 MB
The problem: pypdf holds all decompressed data in memory. A small PDF file can occupy a very large memory space after decompression.
2. Exploiting Repeating Patterns
An attacker can turn LZW’s strength into a vulnerability by creating data with especially repeating patterns:
Attacker data:
"ABCABCABCABC..." (repeated 1 million times)
LZW compression:
1. "ABC" → dictionary[256]
2. "BCA" → dictionary[257]
3. "CAB" → dictionary[258]
4. "ABC" → already exists, use code 256
5. ... (very high compression ratio)
Result:
- Compressed: ~50 KB
- Decompressed: 3 MB
- Compression ratio: 60:1
With larger patterns, this ratio can increase even more.
3. pypdf’s Current Implementation
pypdf’s LZW decompression implementation works like this:
# pypdf/_codecs/_codecs.py
class LzwCodec(Codec):
def decode(self, data: bytes) -> bytes:
output_stream = io.BytesIO()
output_length = 0
# Create dictionary
dictionary = {i: bytes([i]) for i in range(256)}
# Decompression loop
while True:
code = read_next_code()
if code in dictionary:
decoded = dictionary[code]
else:
# Special case: code not in dictionary yet
decoded = previous + previous[0:1]
output_stream.write(decoded)
output_length += len(decoded)
# Limit check
if output_length > self.max_output_length:
raise LimitReachedError(...)
# Add new pattern to dictionary
new_pattern = previous + decoded[0:1]
dictionary[next_code] = new_pattern
return output_stream.getvalue() # All data in memory!
Critical issue: The output_stream.getvalue() call holds all decompressed data in memory. This causes significant memory consumption for large files.
4. Root Cause: Holding All Data in Memory
By nature, LZW decompression doesn’t require holding all output in memory. However, pypdf’s current implementation holds all data in a BytesIO buffer and returns it all at once with getvalue().
Why is this a problem?
- Streaming approach is possible: LZW decompression can be done in a streaming fashion (in chunks)
- Memory efficiency: For large files, streaming significantly reduces memory usage
- DoS risk: Holding all data in memory allows small PDFs to cause large memory consumption
Why Doesn’t ZLIB Suffer from This Issue?
ZLIB (and DEFLATE) uses a different approach than LZW. This difference makes ZLIB more resilient to this type of DoS attack.
ZLIB/DEFLATE Algorithm
ZLIB uses the DEFLATE algorithm. DEFLATE is a two-stage compression method:
- LZ77 (Lempel-Ziv 77): Finds repeating patterns using a sliding window
- Huffman Coding: Encodes found patterns more efficiently
Key difference: ZLIB uses a sliding window, while LZW uses a growing dictionary.
Sliding Window vs Growing Dictionary
Dictionary vs Window Comparison
LZW (Growing Dictionary)
ZLIB (Sliding Window)
ZLIB’s Resilience to DoS
1. Limited Window Size:
ZLIB’s sliding window has a fixed size (typically 32 KB). This prevents the dictionary from growing unlimited.
LZW: Dictionary → can grow unlimited
ZLIB: Window → fixed 32 KB
2. Lower Compression Ratios:
ZLIB cannot achieve compression ratios as high as LZW (especially for repeating patterns):
Repeating pattern: "ABCABCABC..."
LZW compression ratio: 20-50:1
ZLIB compression ratio: 5-10:1
This makes it harder for attackers to create large memory consumption with small files.
3. pypdf’s ZLIB Limit:
pypdf uses a lower limit for ZLIB:
# pypdf/filters.py
ZLIB_MAX_OUTPUT_LENGTH = 75_000_000 # 75 MB
LZW_MAX_OUTPUT_LENGTH = 1_000_000_000 # 1 GB (13.3x higher!)
This inconsistency makes LZW more risky.
Comparison Table
| Feature | LZW | ZLIB/DEFLATE |
|---|---|---|
| Dictionary/Window | Growing dictionary | Fixed sliding window |
| Maximum size | Theoretically unlimited | Fixed (32 KB) |
| Compression ratio (repeating pattern) | 20-50:1 | 5-10:1 |
| Memory usage (decompression) | High (all data) | Lower (streaming possible) |
| DoS risk | High | Lower |
| pypdf limit | 1 GB | 75 MB |
CVE-2025-66019: Vulnerability Details
Vulnerability Summary
CVE-2025-66019 is a DoS vulnerability found in the pypdf library’s LZW decompression implementation. The library had added a limit mechanism with a previous security fix (CVE-2025-62708), but the 1 GB limit value is still insufficient for practical security.
Key Findings:
- ✅ Limit mechanism exists (added in version 6.1.3)
- ⚠️ Limit value (1 GB) is too high for practical security
- ⚠️ Inconsistency with ZLIB limit (75 MB vs 1 GB)
- ✅ Proof of Concept demonstrates exploitability
Affected Components
- File:
pypdf/_codecs/_codecs.py(LzwCodec class) - File:
pypdf/filters.py(LZWDecode class) - Current Limit:
LZW_MAX_OUTPUT_LENGTH = 1_000_000_000(1 GB)
Patched Versions
This vulnerability has been fixed in pypdf >= 6.4.0. Affected users are advised to update the library.
Practical Attack Scenario
An attacker can:
- Create 950 MB decompressed data (repeating patterns)
- Compress with LZW to create a ~4-20 MB PDF file
- Upload to server
Server (pypdf):
- Reads small PDF (20 MB)
- Decompresses LZW stream
- Allocates 950 MB memory
- CPU-intensive processing (1-3 minutes)
- Memory exhaustion → DoS
Proof of Concept Results
Test PDF (100 MB decompressed):
- PDF file size: ~2-5 MB (compressed)
- Decompressed size: ~100 MB
- Processing time: 5-15 seconds
- Memory usage: +100-150 MB
Full Impact PDF (950 MB decompressed):
- PDF file size: ~4-20 MB (compressed)
- Decompressed size: ~950 MB
- Processing time: 1-3 minutes
- Memory usage: +950-1100 MB
Compression Ratio Demonstration
Demo results:
- Simple repeating data: 18.87:1 compression
- Pattern-based data: 19.74:1 compression
- Large compressible data: 219.28:1 compression
- Random data: 0.74:1 (LZW performs poorly)
Impact Assessment
Affected Environments
Serverless Functions (AWS Lambda, Cloud Functions):
- Memory limits: 128 MB - 10 GB
- Impact: Memory exhaustion, crashes, cost increase
API Endpoints (PDF processing services):
- Concurrent requests
- Impact: Resource depletion, service degradation
Embedded Systems:
- Limited RAM (256 MB - 1 GB)
- Impact: System crashes, denial of service
Web Applications:
- PDF upload/processing features
- Impact: Server slowdown, memory exhaustion
Conclusion
CVE-2025-66019 is a DoS vulnerability found in the pypdf library. The root cause of the vulnerability is LZW algorithm’s high compression capability and pypdf’s approach of holding all decompressed data in memory. Algorithms like ZLIB are less affected by this issue due to their sliding window approach. This vulnerability has been fixed in pypdf >= 6.4.0.
Related content:
- exifLooter: Extracting Hidden Location Information from Photos
- PassDetective: Detecting Passwords and Secrets in Your Shell History
Resources: