Linear Congruential Generator for IP Sharding

Go to file

acidvegas 9eadeb54b3 Greatly improved LCG math and code		2024-11-26 00:09:16 -05:00
pylcg.py	Greatly improved LCG math and code	2024-11-26 00:09:16 -05:00
README.md	Greatly improved LCG math and code	2024-11-26 00:09:16 -05:00
unit_test.py	Greatly improved LCG math and code	2024-11-26 00:09:16 -05:00

README.md

PyLCG

Linear Congruential Generator for IP Sharding

PyLCG is a Python implementation of a memory-efficient IP address sharding system using Linear Congruential Generators (LCG) for deterministic random number generation. This tool aids in distributed scanning & network reconnaissance by efficiently dividing IP ranges across multiple machines while being in a pseudo-random order.

Overview
How It Works
Real-World Applications
- Network Security Testing
- Cloud-Based Scanning

Overview

When performing network reconnaissance or scanning large IP ranges, it's often necessary to split the work across multiple machines. However, this presents several challenges:

You want to ensure each machine works on a different part of the network (no overlap)
You want to avoid scanning IPs in sequence (which can trigger security alerts)
You need a way to resume scans if a machine fails
You can't load millions of IPs into memory at once

PyLCG solves these challenges through clever mathematics & efficient algorithms.

How It Works

Understanding IP Addresses

First, let's understand how IP addresses work in our system:

An IP address like 192.168.1.1 is really just a 32-bit number equal to 3232235777 or 0xC0A80101 in hexadecimal
A CIDR range like 192.168.0.0/16 represents a continuous range of these numbers
- For example, 192.168.0.0/16 includes all IPs from 192.168.0.0 to 192.168.255.255 (65,536 addresses)
- The 32-bit number can be represented as 0xC0A80000 in hexadecimal & its from 3232235520 to 3232239103 in decimal

The Magic of Linear Congruential Generators

At the heart of PyLCG is something called a Linear Congruential Generator (LCG). Think of it as a mathematical recipe that generates a sequence of numbers that appear random but are actually predictable if you know the starting point (seed).

Here's how it works:

Start with a number (called the seed, which can be random)
Multiply it by 1664525 & add 1013904223
Take the remainder when divided by 2^32 (the modulo operando)
Repeat the process to continue the sequence

Mathematical notation:

Next_Number = (1664525 * Current_Number + 1013904223) mod 2^32

Why these specific numbers?

The numbers 1664525 and 1013904223 are the multiplier and increment values used in a Linear Congruential Generator (LCG) for random number generation. This specific combination was featured in "Numerical Recipes in C" and became widely known through its use in glibc's rand() implementation.

Sharding: Dividing the Work

PyLCG uses an interleaved sharding approach to ensure truly distributed scanning. Here's how it works:

Interleaved Distribution: Instead of dividing the IP range into sequential blocks, PyLCG distributes IPs across shards using an offset pattern:
- For 4 shards scanning a network:
  - Shard 0 handles IPs at indices: 0, 4, 8, 12, ...
  - Shard 1 handles IPs at indices: 1, 5, 9, 13, ...
  - Shard 2 handles IPs at indices: 2, 6, 10, 14, ...
  - Shard 3 handles IPs at indices: 3, 7, 11, 15, ...
Randomization: Within each shard, the LCG randomizes the order of IPs:
- Each index is fed through the LCG to generate a random value
- IPs are scanned in order of these random values
- The same seed ensures consistent ordering across runs

This approach ensures:

Even distribution across the entire IP space
No sequential scanning patterns that could trigger alerts
Perfect distribution of work across shards
Deterministic results that can be reproduced

Memory-Efficient Processing

To handle large IP ranges without consuming too much memory, PyLCG uses several techniques:

Chunked Processing Instead of loading all IPs at once, it processes them in chunks.
Lazy Generation
- IPs are generated only when needed using Python's async generators
- The system yields one IP at a time rather than creating huge lists
- This keeps memory usage constant regardless of IP range size
Direct Calculation
- The LCG can jump directly to any position in its sequence
- No need to generate all previous numbers
- Enables efficient random access to any part of the sequence

Roadmap

Add support for IPv6
Add support for custom LCG parameters like adding port numbers
Add support for custom chunk sizes & auto-tuning based on available system resources
Add support for resuming from a specific point in the sequence
Add support for saving the state of the LCG to a file so you can resume later
Add support for sharding line-based input files locally, from as s3 bucket, or from a URL by reading it in chunks.
Update the unit tests to include benchmarks & better coverage for future efficiency improvements & validation.