Linear Congruential Generator for IP Sharding
Go to file
2024-11-25 22:28:06 -05:00
pylcg.py Initial commit 2024-11-25 22:28:06 -05:00
README.md Initial commit 2024-11-25 22:28:06 -05:00
unit_test.py Initial commit 2024-11-25 22:28:06 -05:00

PyLCG

Linear Congruential Generator for IP Sharding

PyLCG is a Python implementation of a memory-efficient IP address sharding system using Linear Congruential Generators (LCG) for deterministic random number generation. This tool enables distributed scanning and network reconnaissance by efficiently dividing IP ranges across multiple machines.


Table of Contents


Project Origins & Purpose

PyLCG was inspired by the elegant IP distribution system used in masscan, the popular mass IP port scanner. While masscan implements this logic as part of its larger codebase, I wanted to isolate and implement this specific component as a standalone Python library that developers can easily integrate into their own projects.

The goal was to create a clean, well-documented implementation that:

  • Can be used as a drop-in solution for any project needing IP distribution capabilities
  • Provides the same reliable mathematical foundation as masscan's approach
  • Is easy to understand and modify for specific needs
  • Works well with modern Python async patterns

By extracting this functionality into its own library, developers can add sophisticated IP distribution capabilities to their network tools without having to reinvent the wheel or extract code from larger projects.


Overview

When performing network reconnaissance or scanning large IP ranges, it's often necessary to split the work across multiple machines. However, this presents several challenges:

  1. You want to ensure each machine works on a different part of the network (no overlap)
  2. You want to avoid scanning IPs in sequence (which can trigger security alerts)
  3. You need a way to resume scans if a machine fails
  4. You can't load millions of IPs into memory at once

PyLCG solves these challenges through clever mathematics and efficient algorithms.


How It Works

Understanding IP Addresses

First, let's understand how IP addresses work in our system:

  • An IP address like 192.168.1.1 is really just a 32-bit number
  • A CIDR range like 192.168.0.0/16 represents a continuous range of these numbers
  • For example, 192.168.0.0/16 includes all IPs from 192.168.0.0 to 192.168.255.255 (65,536 addresses)

The Magic of Linear Congruential Generators

At the heart of PyLCG is something called a Linear Congruential Generator (LCG). Think of it as a mathematical recipe that generates a sequence of numbers that appear random but are actually predictable if you know the starting point (seed).

Here's how it works:

  1. Start with a number (called the seed)
  2. Multiply it by a carefully chosen constant (1597 in our case)
  3. Add another carefully chosen constant (51749)
  4. Take the remainder when divided by 2^32
  5. That's your next number! Repeat the process to get more numbers

In mathematical notation:

Next_Number = (1597 * Current_Number + 51749) mod 2^32

Why these specific numbers?

  • 1597 and 51749 were chosen because they create a sequence that:
    • Visits every possible number before repeating (maximum period)
    • Spreads numbers evenly across the range
    • Can be calculated quickly on computers
  • 2^32 (4,294,967,296) is used because it:
    • Matches the size of a 32-bit integer
    • Is large enough to handle any IP range
    • Makes calculations efficient on modern CPUs

Sharding: Dividing the Work

Let's say you want to scan a /16 network (65,536 IPs) using 4 machines. Here's how PyLCG handles it:

  1. Division: First, it divides the total IPs evenly:

    • 65,536 ÷ 4 = 16,384 IPs per shard
    • Machine 1: IPs 0-16,383
    • Machine 2: IPs 16,384-32,767
    • Machine 3: IPs 32,768-49,151
    • Machine 4: IPs 49,152-65,535
  2. Randomization: Within each shard, IPs are randomized using the LCG:

    • Each IP index (0 to 65,535) is fed through the LCG
    • The resulting numbers determine the scan order
    • Because we use the same seed, this order is consistent across runs

Example of how IPs might be ordered in Shard 1:

Original order: 0, 1, 2, 3, 4, 5...
LCG values:    51749, 134238, 297019, 12983...
Final order:   3, 5, 1, 4, 2, 0...  (sorted by LCG values)

Memory-Efficient Processing

To handle large IP ranges without consuming too much memory, PyLCG uses several techniques:

  1. Chunked Processing Instead of loading all IPs at once, it processes them in chunks:

    # Example with chunk_size = 1000
    Chunk 1: Process IPs 0-999
    Chunk 2: Process IPs 1000-1999
    ...and so on
    
  2. Lazy Generation

    • IPs are generated only when needed using Python's async generators
    • The system yields one IP at a time rather than creating huge lists
    • This keeps memory usage constant regardless of IP range size
  3. Direct Calculation

    • The LCG can jump directly to any position in its sequence
    • No need to generate all previous numbers
    • Enables efficient random access to any part of the sequence

Real-World Applications

Network Security Testing

Imagine you're testing the security of a large corporate network:

  • You have 5 scanning machines
  • You need to scan 1 million IPs
  • You want to avoid triggering IDS/IPS systems

PyLCG helps by:

  1. Dividing the IPs evenly across your 5 machines
  2. Randomizing the scan order to avoid detection
  3. Allowing you to pause/resume scans from any point
  4. Using minimal memory on each machine

Cloud-Based Scanning

In cloud environments, PyLCG is particularly useful:

  • Easily scale up/down the number of scanning instances
  • Each instance knows exactly which IPs to scan
  • Consistent results across multiple runs
  • Efficient resource usage keeps costs down

Mirrors for this repository: acid.vegasSuperNETsGitHubGitLabCodeberg