pylcg.py | ||
README.md | ||
unit_test.py |
PyLCG
Linear Congruential Generator for IP Sharding
PyLCG is a Python implementation of a memory-efficient IP address sharding system using Linear Congruential Generators (LCG) for deterministic random number generation. This tool enables distributed scanning and network reconnaissance by efficiently dividing IP ranges across multiple machines.
Table of Contents
Project Origins & Purpose
PyLCG was inspired by the elegant IP distribution system used in masscan, the popular mass IP port scanner. While masscan implements this logic as part of its larger codebase, I wanted to isolate and implement this specific component as a standalone Python library that developers can easily integrate into their own projects.
The goal was to create a clean, well-documented implementation that:
- Can be used as a drop-in solution for any project needing IP distribution capabilities
- Provides the same reliable mathematical foundation as masscan's approach
- Is easy to understand and modify for specific needs
- Works well with modern Python async patterns
By extracting this functionality into its own library, developers can add sophisticated IP distribution capabilities to their network tools without having to reinvent the wheel or extract code from larger projects.
Overview
When performing network reconnaissance or scanning large IP ranges, it's often necessary to split the work across multiple machines. However, this presents several challenges:
- You want to ensure each machine works on a different part of the network (no overlap)
- You want to avoid scanning IPs in sequence (which can trigger security alerts)
- You need a way to resume scans if a machine fails
- You can't load millions of IPs into memory at once
PyLCG solves these challenges through clever mathematics and efficient algorithms.
How It Works
Understanding IP Addresses
First, let's understand how IP addresses work in our system:
- An IP address like
192.168.1.1
is really just a 32-bit number - A CIDR range like
192.168.0.0/16
represents a continuous range of these numbers - For example,
192.168.0.0/16
includes all IPs from192.168.0.0
to192.168.255.255
(65,536 addresses)
The Magic of Linear Congruential Generators
At the heart of PyLCG is something called a Linear Congruential Generator (LCG). Think of it as a mathematical recipe that generates a sequence of numbers that appear random but are actually predictable if you know the starting point (seed).
Here's how it works:
- Start with a number (called the seed)
- Multiply it by a carefully chosen constant (1597 in our case)
- Add another carefully chosen constant (51749)
- Take the remainder when divided by 2^32
- That's your next number! Repeat the process to get more numbers
In mathematical notation:
Next_Number = (1597 * Current_Number + 51749) mod 2^32
Why these specific numbers?
1597
and51749
were chosen because they create a sequence that:- Visits every possible number before repeating (maximum period)
- Spreads numbers evenly across the range
- Can be calculated quickly on computers
2^32
(4,294,967,296) is used because it:- Matches the size of a 32-bit integer
- Is large enough to handle any IP range
- Makes calculations efficient on modern CPUs
Sharding: Dividing the Work
Let's say you want to scan a /16 network (65,536 IPs) using 4 machines. Here's how PyLCG handles it:
-
Division: First, it divides the total IPs evenly:
- 65,536 ÷ 4 = 16,384 IPs per shard
- Machine 1: IPs 0-16,383
- Machine 2: IPs 16,384-32,767
- Machine 3: IPs 32,768-49,151
- Machine 4: IPs 49,152-65,535
-
Randomization: Within each shard, IPs are randomized using the LCG:
- Each IP index (0 to 65,535) is fed through the LCG
- The resulting numbers determine the scan order
- Because we use the same seed, this order is consistent across runs
Example of how IPs might be ordered in Shard 1:
Original order: 0, 1, 2, 3, 4, 5...
LCG values: 51749, 134238, 297019, 12983...
Final order: 3, 5, 1, 4, 2, 0... (sorted by LCG values)
Memory-Efficient Processing
To handle large IP ranges without consuming too much memory, PyLCG uses several techniques:
-
Chunked Processing Instead of loading all IPs at once, it processes them in chunks:
# Example with chunk_size = 1000 Chunk 1: Process IPs 0-999 Chunk 2: Process IPs 1000-1999 ...and so on
-
Lazy Generation
- IPs are generated only when needed using Python's async generators
- The system yields one IP at a time rather than creating huge lists
- This keeps memory usage constant regardless of IP range size
-
Direct Calculation
- The LCG can jump directly to any position in its sequence
- No need to generate all previous numbers
- Enables efficient random access to any part of the sequence
Real-World Applications
Network Security Testing
Imagine you're testing the security of a large corporate network:
- You have 5 scanning machines
- You need to scan 1 million IPs
- You want to avoid triggering IDS/IPS systems
PyLCG helps by:
- Dividing the IPs evenly across your 5 machines
- Randomizing the scan order to avoid detection
- Allowing you to pause/resume scans from any point
- Using minimal memory on each machine
Cloud-Based Scanning
In cloud environments, PyLCG is particularly useful:
- Easily scale up/down the number of scanning instances
- Each instance knows exactly which IPs to scan
- Consistent results across multiple runs
- Efficient resource usage keeps costs down