pylcg/README.md

5.5 KiB

PyLCG

Linear Congruential Generator for IP Sharding

PyLCG is a Python implementation of a memory-efficient IP address sharding system using Linear Congruential Generators (LCG) for deterministic random number generation. This tool aids in distributed scanning & network reconnaissance by efficiently dividing IP ranges across multiple machines while being in a pseudo-random order.


Table of Contents


Overview

When performing network reconnaissance or scanning large IP ranges, it's often necessary to split the work across multiple machines. However, this presents several challenges:

  1. You want to ensure each machine works on a different part of the network (no overlap)
  2. You want to avoid scanning IPs in sequence (which can trigger security alerts)
  3. You need a way to resume scans if a machine fails
  4. You can't load millions of IPs into memory at once

PyLCG solves these challenges through clever mathematics & efficient algorithms.


How It Works

Understanding IP Addresses

First, let's understand how IP addresses work in our system:

  • An IP address like 192.168.1.1 is really just a 32-bit number equal to 3232235777 or 0xC0A80101 in hexadecimal
  • A CIDR range like 192.168.0.0/16 represents a continuous range of these numbers
    • For example, 192.168.0.0/16 includes all IPs from 192.168.0.0 to 192.168.255.255 (65,536 addresses)
    • The 32-bit number can be represented as 0xC0A80000 in hexadecimal & its from 3232235520 to 3232239103 in decimal

The Magic of Linear Congruential Generators

At the heart of PyLCG is something called a Linear Congruential Generator (LCG). Think of it as a mathematical recipe that generates a sequence of numbers that appear random but are actually predictable if you know the starting point (seed).

Here's how it works:

  1. Start with a number (called the seed, which can be random)
  2. Multiply it by 1664525 & add 1013904223
  3. Take the remainder when divided by 2^32 (the modulo operando)
  4. Repeat the process to continue the sequence
Mathematical notation:
Next_Number = (1664525 * Current_Number + 1013904223) mod 2^32
Why these specific numbers?

The numbers 1664525 and 1013904223 are the multiplier and increment values used in a Linear Congruential Generator (LCG) for random number generation. This specific combination was featured in "Numerical Recipes in C" and became widely known through its use in glibc's rand() implementation.

Sharding: Dividing the Work

PyLCG uses an interleaved sharding approach to ensure truly distributed scanning. Here's how it works:

  1. Interleaved Distribution: Instead of dividing the IP range into sequential blocks, PyLCG distributes IPs across shards using an offset pattern:

    • For 4 shards scanning a network:
      • Shard 0 handles IPs at indices: 0, 4, 8, 12, ...
      • Shard 1 handles IPs at indices: 1, 5, 9, 13, ...
      • Shard 2 handles IPs at indices: 2, 6, 10, 14, ...
      • Shard 3 handles IPs at indices: 3, 7, 11, 15, ...
  2. Randomization: Within each shard, the LCG randomizes the order of IPs:

    • Each index is fed through the LCG to generate a random value
    • IPs are scanned in order of these random values
    • The same seed ensures consistent ordering across runs

This approach ensures:

  • Even distribution across the entire IP space
  • No sequential scanning patterns that could trigger alerts
  • Perfect distribution of work across shards
  • Deterministic results that can be reproduced

Memory-Efficient Processing

To handle large IP ranges without consuming too much memory, PyLCG uses several techniques:

  1. Chunked Processing Instead of loading all IPs at once, it processes them in chunks.

  2. Lazy Generation

    • IPs are generated only when needed using Python's async generators
    • The system yields one IP at a time rather than creating huge lists
    • This keeps memory usage constant regardless of IP range size
  3. Direct Calculation

    • The LCG can jump directly to any position in its sequence
    • No need to generate all previous numbers
    • Enables efficient random access to any part of the sequence

Roadmap

  • Add support for IPv6
  • Add support for custom LCG parameters like adding port numbers
  • Add support for custom chunk sizes & auto-tuning based on available system resources
  • Add support for resuming from a specific point in the sequence
  • Add support for saving the state of the LCG to a file so you can resume later
  • Add support for sharding line-based input files locally, from as s3 bucket, or from a URL by reading it in chunks.
  • Update the unit tests to include benchmarks & better coverage for future efficiency improvements & validation.

Mirrors for this repository: acid.vegasSuperNETsGitHubGitLabCodeberg