Initial commit

2024-11-25 22:28:06 -05:00 · 2024-11-25 22:28:06 -05:00 · 080d46ea3d
commit 080d46ea3d
3 changed files with 467 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,161 @@
 # PyLCG
 > Linear Congruential Generator for IP Sharding
 PyLCG is a Python implementation of a memory-efficient IP address sharding system using Linear Congruential Generators *(LCG)* for deterministic random number generation. This tool enables distributed scanning and network reconnaissance by efficiently dividing IP ranges across multiple machines.
 ___
 ## Table of Contents
 - [Project Origins & Purpose](#project-origins-and-purpose)
 - [Overview](#overview)
 - [How It Works](#how-it-works)
    - [Understanding IP Addresses](#understanding-ip-addresses)
    - [The Magic of Linear Congruential Generators](#the-magic-of-linear-congruential-generators)
    - [Sharding: Dividing the Work](#sharding-dividing-the-work)
    - [Memory-Efficient Processing](#memory-efficient-processing)
 - [Real-World Applications](#real-world-applications)
    - [Network Security Testing](#network-security-testing)
    - [Cloud-Based Scanning](#cloud-based-scanning)
 ___
 ## Project Origins & Purpose
 PyLCG was inspired by the elegant IP distribution system used in [masscan](https://github.com/robertdavidgraham/masscan), the popular mass IP port scanner. While masscan implements this logic as part of its larger codebase, I wanted to isolate and implement this specific component as a standalone Python library that developers can easily integrate into their own projects.
 The goal was to create a clean, well-documented implementation that:
 - Can be used as a drop-in solution for any project needing IP distribution capabilities
 - Provides the same reliable mathematical foundation as masscan's approach
 - Is easy to understand and modify for specific needs
 - Works well with modern Python async patterns
 By extracting this functionality into its own library, developers can add sophisticated IP distribution capabilities to their network tools without having to reinvent the wheel or extract code from larger projects.
 ___
 ## Overview
 When performing network reconnaissance or scanning large IP ranges, it's often necessary to split the work across multiple machines. However, this presents several challenges:
 1. You want to ensure each machine works on a different part of the network *(no overlap)*
 2. You want to avoid scanning IPs in sequence *(which can trigger security alerts)*
 3. You need a way to resume scans if a machine fails
 4. You can't load millions of IPs into memory at once
 PyLCG solves these challenges through clever mathematics and efficient algorithms.
 ___
 ## How It Works
 ### Understanding IP Addresses
 First, let's understand how IP addresses work in our system:
 - An IP address like `192.168.1.1` is really just a 32-bit number
 - A CIDR range like `192.168.0.0/16` represents a continuous range of these numbers
 - For example, `192.168.0.0/16` includes all IPs from `192.168.0.0` to `192.168.255.255` *(65,536 addresses)*
 ### The Magic of Linear Congruential Generators
 At the heart of PyLCG is something called a Linear Congruential Generator *(LCG)*. Think of it as a mathematical recipe that generates a sequence of numbers that appear random but are actually predictable if you know the starting point *(seed)*.
 Here's how it works:
 1. Start with a number *(called the seed)*
 2. Multiply it by a carefully chosen constant *(1597 in our case)*
 3. Add another carefully chosen constant *(51749)*
 4. Take the remainder when divided by 2^32
 5. That's your next number! Repeat the process to get more numbers
 In mathematical notation:
 ```
 Next_Number = (1597 * Current_Number + 51749) mod 2^32
 ```
 Why these specific numbers?
 - `1597` and `51749` were chosen because they create a sequence that:
  - Visits every possible number before repeating *(maximum period)*
  - Spreads numbers evenly across the range
  - Can be calculated quickly on computers
 - `2^32` *(4,294,967,296)* is used because it:
  - Matches the size of a 32-bit integer
  - Is large enough to handle any IP range
  - Makes calculations efficient on modern CPUs
 ### Sharding: Dividing the Work
 Let's say you want to scan a /16 network *(65,536 IPs)* using 4 machines. Here's how PyLCG handles it:
 1. **Division**: First, it divides the total IPs evenly:
   - 65,536 ÷ 4 = 16,384 IPs per shard
   - Machine 1: IPs 0-16,383
   - Machine 2: IPs 16,384-32,767
   - Machine 3: IPs 32,768-49,151
   - Machine 4: IPs 49,152-65,535
 2. **Randomization**: Within each shard, IPs are randomized using the LCG:
   - Each IP index *(0 to 65,535)* is fed through the LCG
   - The resulting numbers determine the scan order
   - Because we use the same seed, this order is consistent across runs
 Example of how IPs might be ordered in Shard 1:
 ```
 Original order: 0, 1, 2, 3, 4, 5...
 LCG values:    51749, 134238, 297019, 12983...
 Final order:   3, 5, 1, 4, 2, 0...  (sorted by LCG values)
 ```
 ### Memory-Efficient Processing
 To handle large IP ranges without consuming too much memory, PyLCG uses several techniques:
 1. **Chunked Processing**
   Instead of loading all IPs at once, it processes them in chunks:
   ```python
   # Example with chunk_size = 1000
   Chunk 1: Process IPs 0-999
   Chunk 2: Process IPs 1000-1999
   ...and so on
   ```
 2. **Lazy Generation**
   - IPs are generated only when needed using Python's async generators
   - The system yields one IP at a time rather than creating huge lists
   - This keeps memory usage constant regardless of IP range size
 3. **Direct Calculation**
   - The LCG can jump directly to any position in its sequence
   - No need to generate all previous numbers
   - Enables efficient random access to any part of the sequence
 ___
 ## Real-World Applications
 ### Network Security Testing
 Imagine you're testing the security of a large corporate network:
 - You have 5 scanning machines
 - You need to scan 1 million IPs
 - You want to avoid triggering IDS/IPS systems
 PyLCG helps by:
 1. Dividing the IPs evenly across your 5 machines
 2. Randomizing the scan order to avoid detection
 3. Allowing you to pause/resume scans from any point
 4. Using minimal memory on each machine
 ### Cloud-Based Scanning
 In cloud environments, PyLCG is particularly useful:
 - Easily scale up/down the number of scanning instances
 - Each instance knows exactly which IPs to scan
 - Consistent results across multiple runs
 - Efficient resource usage keeps costs down
 ___
 ###### Mirrors for this repository: [acid.vegas](https://git.acid.vegas/pylcg) • [SuperNETs](https://git.supernets.org/acidvegas/pylcg) • [GitHub](https://github.com/acidvegas/pylcg) • [GitLab](https://gitlab.com/acidvegas/pylcg) • [Codeberg](https://codeberg.org/acidvegas/pylcg)
--- a/pylcg.py
+++ b/pylcg.py
@ -0,0 +1,134 @@
 #!/usr/bin/env python3
 # Python implementation of a Linear Congruential Generator for IP Sharding - Developed by acidvegas in Python (https://git.acid.vegas/pylcg)
 # pylcg.py
 import argparse
 import asyncio
 import ipaddress
 from math import ceil
 class LCG:
    '''Linear Congruential Generator for deterministic random number generation'''
    def __init__(self, seed: int, m: int = 2**32):
        self.m       = m
        self.a       = 1597
        self.c       = 51749
        self.seed    = seed
        self.current = seed
    def get_nth(self, n: int) -> int:
        '''
        Get the nth number in the sequence without generating previous numbers.
        :param n: The index of the number to get
        '''
        # For large n, use the standard next() method to avoid modular arithmetic issues
        if n > 1000:
            self.current = self.seed
            for _ in range(n):
                self.next()
            return self.current
        # For smaller n, use direct calculation
        result = self.seed
        for _ in range(n):
            result = (self.a * result + self.c) % self.m
        return result
    def next(self) -> int:
        '''Generate next random number'''
        self.current = (self.a * self.current + self.c) % self.m
        return self.current
 class IPRange:
    '''Memory-efficient IP range iterator'''
    def __init__(self, cidr: str):
        network    = ipaddress.ip_network(cidr)
        self.start = int(network.network_address)
        self.end   = int(network.broadcast_address)
        self.total = self.end - self.start + 1
    def get_ip_at_index(self, index: int) -> str:
        '''
        Get IP at specific index without generating previous IPs
        :param index: The index of the IP to get
        '''
        if not 0 <= index < self.total:
            raise IndexError('IP index out of range')
        return str(ipaddress.ip_address(self.start + index))
 async def get_shard_ips(cidr: str, shard_num: int, total_shards: int, seed: int, chunk_size: int = 1000):
    '''
    Asynchronously generate IPs for the specified shard.
    :param cidr: The CIDR range to shard
    :param shard_num: The number of the shard to generate
    :param total_shards: The total number of shards
    :param seed: The seed for the random number generator
    :param chunk_size: The size of the chunks to process
    '''
    # Initialize the IP range and LCG
    ip_range  = IPRange(cidr)
    lcg       = LCG(seed)
    total_ips = ip_range.total
    # Calculate which indices belong to this shard
    shard_size = ceil(total_ips / total_shards)
    start_idx  = shard_num * shard_size
    end_idx    = min(start_idx + shard_size, total_ips)
    # Process in chunks to maintain memory efficiency
    for chunk_start in range(start_idx, end_idx, chunk_size):
        chunk_end     = min(chunk_start + chunk_size, end_idx)
        chunk_indices = list(range(chunk_start, chunk_end))
        # Generate random values for this chunk
        chunk_random_values = [(i, lcg.get_nth(i)) for i in chunk_indices]
        chunk_random_values.sort(key=lambda x: x[1])
        # Yield IPs in randomized order
        for idx, _ in chunk_random_values:
            yield ip_range.get_ip_at_index(idx)
            # Allow other tasks to run (do we need this?)
            await asyncio.sleep(0)
 async def main():
    parser = argparse.ArgumentParser(description='Async IP address sharding tool')
    parser.add_argument('cidr', help='Target IP range in CIDR format')
    parser.add_argument('shard_num', type=int, help='Shard number (0-based)')
    parser.add_argument('total_shards', type=int, help='Total number of shards')
    parser.add_argument('--seed', type=int, default=12345, help='Random seed for LCG')
    parser.add_argument('--chunk-size', type=int, default=1000, help='Processing chunk size')
    args = parser.parse_args()
    if args.shard_num >= args.total_shards:
        raise ValueError('Shard number must be less than total shards')
    if args.shard_num < 0 or args.total_shards < 1:
        raise ValueError('Invalid shard configuration')
    async for ip in get_shard_ips(args.cidr, args.shard_num, args.total_shards, args.seed, args.chunk_size):
        print(ip)
 if __name__ == '__main__':
    asyncio.run(main()) 
--- a/unit_test.py
+++ b/unit_test.py
@ -0,0 +1,172 @@
 #!/usr/bin/env python3
 # Python implementation of a Linear Congruential Generator for IP Sharding - Developed by acidvegas in Python (https://git.acid.vegas/pylcg)
 # pylcg.py
 import unittest
 import asyncio
 import ipaddress
 import sys
 import time
 from pylcg import IPRange, get_shard_ips, LCG
 # ANSI color codes
 class Colors:
    BLUE   = '\033[94m'
    GREEN  = '\033[92m'
    YELLOW = '\033[93m'
    CYAN   = '\033[96m'
    RED    = '\033[91m'
    ENDC   = '\033[0m'
 def progress_bar(iteration: int, total: int, prefix: str = '', length: int = 50) -> None:
    '''Simple progress bar using standard Python'''
    percent = f"{100 * (iteration / float(total)):.1f}"
    filled_length = int(length * iteration // total)
    bar = '█' * filled_length + '-' * (length - filled_length)
    sys.stdout.write(f'\r{Colors.CYAN}{prefix} |{bar}| {percent}%{Colors.ENDC} ')
    if iteration == total:
        sys.stdout.write('\n')
    sys.stdout.flush()
 def print_header(message: str) -> None:
    '''Print formatted header'''
    print(f'\n{Colors.BLUE}{"="*80}')
    print(f'TEST: {message}')
    print(f'{"="*80}{Colors.ENDC}\n')
 def print_success(message: str) -> None:
    '''Print success message'''
    print(f'{Colors.GREEN}✓ {message}{Colors.ENDC}')
 def print_progress(message: str) -> None:
    '''Print progress message'''
    print(f"{Colors.YELLOW}⟳ {message}{Colors.ENDC}")
 class TestIPSharder(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        '''Set up test parameters'''
        print_header('Setting up test environment')
        cls.test_cidr = '192.0.0.0/16'  # 65,536 IPs
        cls.test_seed = 12345
        cls.total_shards = 4
        cls.chunk_size = 1000
        # Calculate expected IPs
        network = ipaddress.ip_network(cls.test_cidr)
        cls.all_ips = {str(ip) for ip in network}
        print_success(f"Initialized test environment with {len(cls.all_ips):,} IPs")
    def setUp(self):
        '''Create event loop for each test'''
        self.loop = asyncio.new_event_loop()
        asyncio.set_event_loop(self.loop)
    def tearDown(self):
        '''Clean up event loop'''
        self.loop.close()
    async def collect_shard_ips(self, shard_num: int):
        '''Helper to collect IPs from a shard'''
        return {ip async for ip in get_shard_ips(self.test_cidr, shard_num, self.total_shards, self.test_seed, self.chunk_size)}
    def test_ip_range_initialization(self):
        '''Test IPRange class initialization and calculations'''
        print_header('Testing IPRange initialization')
        ip_range = IPRange(self.test_cidr)
        self.assertEqual(ip_range.total, 65536)
        print_success('IP range size correctly calculated')
        first_ip = ip_range.get_ip_at_index(0)
        last_ip = ip_range.get_ip_at_index(ip_range.total - 1)
        print_success(f'IP range spans from {first_ip} to {last_ip}')
    def test_shard_completeness(self):
        '''Test that all IPs are covered exactly once across all shards'''
        print_header('Testing shard completeness')
        async def check_completeness():
            seen_ips = set()
            shard_sizes = []
            for shard_num in range(self.total_shards):
                progress_bar(shard_num, self.total_shards-1, prefix='Processing shards')
                shard_ips = await self.collect_shard_ips(shard_num)
                shard_sizes.append(len(shard_ips))
                # Check for duplicates and overlap
                self.assertEqual(len(shard_ips), len(set(shard_ips)),
                               f'Duplicates found in shard {shard_num}')
                overlap = seen_ips & shard_ips
                self.assertEqual(len(overlap), 0,
                               f'Overlap found with previous shards: {overlap}')
                seen_ips.update(shard_ips)
            # Verify all IPs are covered
            self.assertEqual(seen_ips, self.all_ips,
                           'Not all IPs were covered by the shards')
            print_success(f'All {len(self.all_ips):,} IPs were distributed across shards')
            # Print distribution information
            for i, size in enumerate(shard_sizes):
                print(f"{Colors.CYAN}Shard {i}: {size:,} IPs{Colors.ENDC}")
        self.loop.run_until_complete(check_completeness())
    def test_lcg_sequence(self):
        '''Test LCG sequence generation and performance'''
        print_header('Testing LCG sequence generation')
        lcg = LCG(seed=self.test_seed)
        # Test small sequence
        small_n      = 100
        start_time   = time.perf_counter()
        small_result = lcg.get_nth(small_n)
        small_time   = time.perf_counter() - start_time
        print_success(f'Small sequence (n={small_n:,}) generated in {small_time:.6f}s')
        # Test large sequence
        large_n      = 1_000_000
        start_time   = time.perf_counter()
        large_result = lcg.get_nth(large_n)
        large_time   = time.perf_counter() - start_time
        print_success(f'Large sequence (n={large_n:,}) generated in {large_time:.6f}s')
        # Verify deterministic behavior
        lcg2 = LCG(seed=self.test_seed)
        print_progress('Verifying sequence determinism...')
        for i in range(large_n):
            if i % (large_n // 100) == 0: # Update progress every 1%
                progress_bar(i, large_n, prefix='Verifying sequence')
            lcg2.next()
        progress_bar(large_n, large_n, prefix='Verifying sequence')
        self.assertEqual(large_result, lcg2.current, 'LCG sequence is not deterministic')
        print_success('LCG produces consistent results')
 if __name__ == '__main__':
    print(f"\n{Colors.CYAN}{'='*80}")
    print(f"Starting IP Sharder Tests - Testing with {65536:,} IPs (/16 network)")
    print(f"{'='*80}{Colors.ENDC}\n")
    unittest.main(verbosity=2)