pylcg.py | ||
README.md | ||
unit_test.py |
PyLCG
Linear Congruential Generator for IP Sharding
PyLCG is a Python implementation of a memory-efficient IP address sharding system using Linear Congruential Generators (LCG) for deterministic random number generation. This tool aids in distributed scanning & network reconnaissance by efficiently dividing IP ranges across multiple machines while being in a pseudo-random order.
Table of Contents
Overview
When performing network reconnaissance or scanning large IP ranges, it's often necessary to split the work across multiple machines. However, this presents several challenges:
- You want to ensure each machine works on a different part of the network (no overlap)
- You want to avoid scanning IPs in sequence (which can trigger security alerts)
- You need a way to resume scans if a machine fails
- You can't load millions of IPs into memory at once
PyLCG solves these challenges through clever mathematics & efficient algorithms.
How It Works
Understanding IP Addresses
First, let's understand how IP addresses work in our system:
- An IP address like
192.168.1.1
is really just a 32-bit number equal to3232235777
or0xC0A80101
in hexadecimal - A CIDR range like
192.168.0.0/16
represents a continuous range of these numbers- For example,
192.168.0.0/16
includes all IPs from192.168.0.0
to192.168.255.255
(65,536 addresses) - The 32-bit number can be represented as
0xC0A80000
in hexadecimal & its from3232235520
to3232239103
in decimal
- For example,
The Magic of Linear Congruential Generators
At the heart of PyLCG is something called a Linear Congruential Generator (LCG). Think of it as a mathematical recipe that generates a sequence of numbers that appear random but are actually predictable if you know the starting point (seed).
Here's how it works:
- Start with a number (called the seed, which can be random)
- Multiply it by
1664525
& add1013904223
- Take the remainder when divided by
2^32
(the modulo operando) - Repeat the process to continue the sequence
Mathematical notation:
Next_Number = (1664525 * Current_Number + 1013904223) mod 2^32
Why these specific numbers?
The numbers 1664525
and 1013904223
are the multiplier and increment values used in a Linear Congruential Generator (LCG) for random number generation. This specific combination was featured in "Numerical Recipes in C" and became widely known through its use in glibc's rand() implementation.
Sharding: Dividing the Work
PyLCG uses an interleaved sharding approach to ensure truly distributed scanning. Here's how it works:
-
Interleaved Distribution: Instead of dividing the IP range into sequential blocks, PyLCG distributes IPs across shards using an offset pattern:
- For 4 shards scanning a network:
- Shard 0 handles IPs at indices: 0, 4, 8, 12, ...
- Shard 1 handles IPs at indices: 1, 5, 9, 13, ...
- Shard 2 handles IPs at indices: 2, 6, 10, 14, ...
- Shard 3 handles IPs at indices: 3, 7, 11, 15, ...
- For 4 shards scanning a network:
-
Randomization: Within each shard, the LCG randomizes the order of IPs:
- Each index is fed through the LCG to generate a random value
- IPs are scanned in order of these random values
- The same seed ensures consistent ordering across runs
This approach ensures:
- Even distribution across the entire IP space
- No sequential scanning patterns that could trigger alerts
- Perfect distribution of work across shards
- Deterministic results that can be reproduced
Memory-Efficient Processing
To handle large IP ranges without consuming too much memory, PyLCG uses several techniques:
-
Chunked Processing Instead of loading all IPs at once, it processes them in chunks.
-
Lazy Generation
- IPs are generated only when needed using Python's async generators
- The system yields one IP at a time rather than creating huge lists
- This keeps memory usage constant regardless of IP range size
-
Direct Calculation
- The LCG can jump directly to any position in its sequence
- No need to generate all previous numbers
- Enables efficient random access to any part of the sequence
Roadmap
- Add support for IPv6
- Add support for custom LCG parameters like adding port numbers
- Add support for custom chunk sizes & auto-tuning based on available system resources
- Add support for resuming from a specific point in the sequence
- Add support for saving the state of the LCG to a file so you can resume later
- Add support for sharding line-based input files locally, from as s3 bucket, or from a URL by reading it in chunks.
- Update the unit tests to include benchmarks & better coverage for future efficiency improvements & validation.