162 lines
6.4 KiB
Markdown
162 lines
6.4 KiB
Markdown
|
# PyLCG
|
||
|
> Linear Congruential Generator for IP Sharding
|
||
|
|
||
|
PyLCG is a Python implementation of a memory-efficient IP address sharding system using Linear Congruential Generators *(LCG)* for deterministic random number generation. This tool enables distributed scanning and network reconnaissance by efficiently dividing IP ranges across multiple machines.
|
||
|
|
||
|
___
|
||
|
|
||
|
## Table of Contents
|
||
|
|
||
|
- [Project Origins & Purpose](#project-origins-and-purpose)
|
||
|
- [Overview](#overview)
|
||
|
- [How It Works](#how-it-works)
|
||
|
- [Understanding IP Addresses](#understanding-ip-addresses)
|
||
|
- [The Magic of Linear Congruential Generators](#the-magic-of-linear-congruential-generators)
|
||
|
- [Sharding: Dividing the Work](#sharding-dividing-the-work)
|
||
|
- [Memory-Efficient Processing](#memory-efficient-processing)
|
||
|
- [Real-World Applications](#real-world-applications)
|
||
|
- [Network Security Testing](#network-security-testing)
|
||
|
- [Cloud-Based Scanning](#cloud-based-scanning)
|
||
|
|
||
|
___
|
||
|
|
||
|
## Project Origins & Purpose
|
||
|
|
||
|
PyLCG was inspired by the elegant IP distribution system used in [masscan](https://github.com/robertdavidgraham/masscan), the popular mass IP port scanner. While masscan implements this logic as part of its larger codebase, I wanted to isolate and implement this specific component as a standalone Python library that developers can easily integrate into their own projects.
|
||
|
|
||
|
The goal was to create a clean, well-documented implementation that:
|
||
|
- Can be used as a drop-in solution for any project needing IP distribution capabilities
|
||
|
- Provides the same reliable mathematical foundation as masscan's approach
|
||
|
- Is easy to understand and modify for specific needs
|
||
|
- Works well with modern Python async patterns
|
||
|
|
||
|
By extracting this functionality into its own library, developers can add sophisticated IP distribution capabilities to their network tools without having to reinvent the wheel or extract code from larger projects.
|
||
|
|
||
|
___
|
||
|
|
||
|
## Overview
|
||
|
|
||
|
When performing network reconnaissance or scanning large IP ranges, it's often necessary to split the work across multiple machines. However, this presents several challenges:
|
||
|
|
||
|
1. You want to ensure each machine works on a different part of the network *(no overlap)*
|
||
|
2. You want to avoid scanning IPs in sequence *(which can trigger security alerts)*
|
||
|
3. You need a way to resume scans if a machine fails
|
||
|
4. You can't load millions of IPs into memory at once
|
||
|
|
||
|
PyLCG solves these challenges through clever mathematics and efficient algorithms.
|
||
|
|
||
|
___
|
||
|
|
||
|
## How It Works
|
||
|
|
||
|
### Understanding IP Addresses
|
||
|
|
||
|
First, let's understand how IP addresses work in our system:
|
||
|
|
||
|
- An IP address like `192.168.1.1` is really just a 32-bit number
|
||
|
- A CIDR range like `192.168.0.0/16` represents a continuous range of these numbers
|
||
|
- For example, `192.168.0.0/16` includes all IPs from `192.168.0.0` to `192.168.255.255` *(65,536 addresses)*
|
||
|
|
||
|
### The Magic of Linear Congruential Generators
|
||
|
|
||
|
At the heart of PyLCG is something called a Linear Congruential Generator *(LCG)*. Think of it as a mathematical recipe that generates a sequence of numbers that appear random but are actually predictable if you know the starting point *(seed)*.
|
||
|
|
||
|
Here's how it works:
|
||
|
|
||
|
1. Start with a number *(called the seed)*
|
||
|
2. Multiply it by a carefully chosen constant *(1597 in our case)*
|
||
|
3. Add another carefully chosen constant *(51749)*
|
||
|
4. Take the remainder when divided by 2^32
|
||
|
5. That's your next number! Repeat the process to get more numbers
|
||
|
|
||
|
In mathematical notation:
|
||
|
```
|
||
|
Next_Number = (1597 * Current_Number + 51749) mod 2^32
|
||
|
```
|
||
|
|
||
|
Why these specific numbers?
|
||
|
|
||
|
- `1597` and `51749` were chosen because they create a sequence that:
|
||
|
- Visits every possible number before repeating *(maximum period)*
|
||
|
- Spreads numbers evenly across the range
|
||
|
- Can be calculated quickly on computers
|
||
|
- `2^32` *(4,294,967,296)* is used because it:
|
||
|
- Matches the size of a 32-bit integer
|
||
|
- Is large enough to handle any IP range
|
||
|
- Makes calculations efficient on modern CPUs
|
||
|
|
||
|
### Sharding: Dividing the Work
|
||
|
|
||
|
Let's say you want to scan a /16 network *(65,536 IPs)* using 4 machines. Here's how PyLCG handles it:
|
||
|
|
||
|
1. **Division**: First, it divides the total IPs evenly:
|
||
|
- 65,536 ÷ 4 = 16,384 IPs per shard
|
||
|
- Machine 1: IPs 0-16,383
|
||
|
- Machine 2: IPs 16,384-32,767
|
||
|
- Machine 3: IPs 32,768-49,151
|
||
|
- Machine 4: IPs 49,152-65,535
|
||
|
|
||
|
2. **Randomization**: Within each shard, IPs are randomized using the LCG:
|
||
|
- Each IP index *(0 to 65,535)* is fed through the LCG
|
||
|
- The resulting numbers determine the scan order
|
||
|
- Because we use the same seed, this order is consistent across runs
|
||
|
|
||
|
Example of how IPs might be ordered in Shard 1:
|
||
|
```
|
||
|
Original order: 0, 1, 2, 3, 4, 5...
|
||
|
LCG values: 51749, 134238, 297019, 12983...
|
||
|
Final order: 3, 5, 1, 4, 2, 0... (sorted by LCG values)
|
||
|
```
|
||
|
|
||
|
### Memory-Efficient Processing
|
||
|
|
||
|
To handle large IP ranges without consuming too much memory, PyLCG uses several techniques:
|
||
|
|
||
|
1. **Chunked Processing**
|
||
|
Instead of loading all IPs at once, it processes them in chunks:
|
||
|
```python
|
||
|
# Example with chunk_size = 1000
|
||
|
Chunk 1: Process IPs 0-999
|
||
|
Chunk 2: Process IPs 1000-1999
|
||
|
...and so on
|
||
|
```
|
||
|
|
||
|
2. **Lazy Generation**
|
||
|
- IPs are generated only when needed using Python's async generators
|
||
|
- The system yields one IP at a time rather than creating huge lists
|
||
|
- This keeps memory usage constant regardless of IP range size
|
||
|
|
||
|
3. **Direct Calculation**
|
||
|
- The LCG can jump directly to any position in its sequence
|
||
|
- No need to generate all previous numbers
|
||
|
- Enables efficient random access to any part of the sequence
|
||
|
|
||
|
___
|
||
|
|
||
|
## Real-World Applications
|
||
|
|
||
|
### Network Security Testing
|
||
|
|
||
|
Imagine you're testing the security of a large corporate network:
|
||
|
- You have 5 scanning machines
|
||
|
- You need to scan 1 million IPs
|
||
|
- You want to avoid triggering IDS/IPS systems
|
||
|
|
||
|
PyLCG helps by:
|
||
|
1. Dividing the IPs evenly across your 5 machines
|
||
|
2. Randomizing the scan order to avoid detection
|
||
|
3. Allowing you to pause/resume scans from any point
|
||
|
4. Using minimal memory on each machine
|
||
|
|
||
|
### Cloud-Based Scanning
|
||
|
|
||
|
In cloud environments, PyLCG is particularly useful:
|
||
|
- Easily scale up/down the number of scanning instances
|
||
|
- Each instance knows exactly which IPs to scan
|
||
|
- Consistent results across multiple runs
|
||
|
- Efficient resource usage keeps costs down
|
||
|
|
||
|
___
|
||
|
|
||
|
###### Mirrors for this repository: [acid.vegas](https://git.acid.vegas/pylcg) • [SuperNETs](https://git.supernets.org/acidvegas/pylcg) • [GitHub](https://github.com/acidvegas/pylcg) • [GitLab](https://gitlab.com/acidvegas/pylcg) • [Codeberg](https://codeberg.org/acidvegas/pylcg)
|