Hyper-fast HTTP Scraping Tool

Go to file

acidvegas e27e5e4095 Allow any form of input for scanning		2025-02-11 19:18:52 -05:00
.screens	added preview	2025-02-10 01:14:37 -05:00
httpz_scanner	Allow any form of input for scanning	2025-02-11 19:18:52 -05:00
.gitignore	Productionalized, read for relase	2025-02-11 02:15:39 -05:00
LICENSE	Title and body cleanup	2025-02-10 00:24:28 -05:00
MANIFEST.in	Productionalized, read for relase	2025-02-11 02:15:39 -05:00
README.md	Allow any form of input for scanning	2025-02-11 19:18:52 -05:00
requirements.txt	Productionalized, read for relase	2025-02-11 02:15:39 -05:00
setup.py	fuck	2025-02-11 02:52:12 -05:00

README.md

HTTPZ Web Scanner

A high-performance concurrent web scanner written in Python. HTTPZ efficiently scans domains for HTTP/HTTPS services, extracting valuable information like status codes, titles, SSL certificates, and more.

Requirements

Python

Installation

Via pip (recommended)

# Install from PyPI
pip install httpz_scanner

# The 'httpz' command will now be available in your terminal
httpz --help

From source

# Clone the repository
git clone https://github.com/acidvegas/httpz
cd httpz
pip install -r requirements.txt

Usage

Command Line Interface

Basic usage:

python -m httpz_scanner domains.txt

Scan with all flags enabled and output to JSONL:

python -m httpz_scanner domains.txt -all -c 100 -o results.jsonl -j -p

Read from stdin:

cat domains.txt | python -m httpz_scanner - -all -c 100
echo "example.com" | python -m httpz_scanner - -all

Filter by status codes and follow redirects:

python -m httpz_scanner domains.txt -mc 200,301-399 -ec 404,500 -fr -p

Show specific fields with custom timeout and resolvers:

python -m httpz_scanner domains.txt -sc -ti -i -tls -to 10 -r resolvers.txt

Full scan with all options:

python -m httpz_scanner domains.txt -c 100 -o output.jsonl -j -all -to 10 -mc 200,301 -ec 404,500 -p -ax -r resolvers.txt

Distributed Scanning

Split scanning across multiple machines using the --shard argument:

# Machine 1
httpz domains.txt --shard 1/3

# Machine 2
httpz domains.txt --shard 2/3

# Machine 3
httpz domains.txt --shard 3/3

Each machine will process a different subset of domains without overlap. For example, with 3 shards:

Machine 1 processes lines 0,3,6,9,...
Machine 2 processes lines 1,4,7,10,...
Machine 3 processes lines 2,5,8,11,...

This allows efficient distribution of large scans across multiple machines.

Python Library

import asyncio
import aiohttp
import aioboto3
from httpz_scanner import HTTPZScanner

async def scan_domains():
    # Initialize scanner with all possible options (showing defaults)
    scanner = HTTPZScanner(
        # Core settings
        concurrent_limit=100,   # Number of concurrent requests
        timeout=5,              # Request timeout in seconds
        follow_redirects=False, # Follow redirects (max 10)
        check_axfr=False,       # Try AXFR transfer against nameservers
        resolver_file=None,     # Path to custom DNS resolvers file
        output_file=None,       # Path to JSONL output file
        show_progress=False,    # Show progress counter
        debug_mode=False,       # Show error states and debug info
        jsonl_output=False,     # Output in JSONL format
        shard=None,             # Tuple of (shard_index, total_shards) for distributed scanning
        
        # Control which fields to show (all False by default unless show_fields is None)
        show_fields={
            'status_code': True,      # Show status code
            'content_type': True,     # Show content type
            'content_length': True,   # Show content length
            'title': True,            # Show page title
            'body': True,             # Show body preview
            'ip': True,               # Show IP addresses
            'favicon': True,          # Show favicon hash
            'headers': True,          # Show response headers
            'follow_redirects': True, # Show redirect chain
            'cname': True,            # Show CNAME records
            'tls': True               # Show TLS certificate info
        },
        
        # Filter results
        match_codes={200,301,302},  # Only show these status codes
        exclude_codes={404,500,503} # Exclude these status codes
    )

    # Initialize resolvers (required before scanning)
    await scanner.init()

    # Example 1: Stream from S3/MinIO using aioboto3
    async with aioboto3.Session().client('s3', 
            endpoint_url='http://minio.example.com:9000',
            aws_access_key_id='access_key',
            aws_secret_access_key='secret_key') as s3:
        
        response = await s3.get_object(Bucket='my-bucket', Key='huge-domains.txt')
        async with response['Body'] as stream:
            async def s3_generator():
                while True:
                    line = await stream.readline()
                    if not line:
                        break
                    yield line.decode().strip()
            
            await scanner.scan(s3_generator())

    # Example 2: Stream from URL using aiohttp
    async with aiohttp.ClientSession() as session:
        # For large files - stream line by line
        async with session.get('https://example.com/huge-domains.txt') as resp:
            async def url_generator():
                async for line in resp.content:
                    yield line.decode().strip()
            
            await scanner.scan(url_generator())
        
        # For small files - read all at once
        async with session.get('https://example.com/small-domains.txt') as resp:
            content = await resp.text()
            await scanner.scan(content)  # Library handles splitting into lines

    # Example 3: Simple list of domains
    domains = [
        'example1.com',
        'example2.com',
        'example3.com'
    ]
    await scanner.scan(domains)

if __name__ == '__main__':
    asyncio.run(scan_domains())

The scanner accepts various input types:

Async/sync generators that yield domains
String content with newlines
Lists/tuples of domains
File paths
stdin (using '-')

All inputs support sharding for distributed scanning.

Arguments

Argument	Long Form	Description
`file`		File containing domains (one per line), use `-` for stdin
`-d`	`--debug`	Show error states and debug information
`-c N`	`--concurrent N`	Number of concurrent checks (default: 100)
`-o FILE`	`--output FILE`	Output file path (JSONL format)
`-j`	`--jsonl`	Output JSON Lines format to console
`-all`	`--all-flags`	Enable all output flags
`-sh`	`--shard N/T`	Process shard N of T total shards (e.g., 1/3)

Output Field Flags

Flag	Long Form	Description
`-sc`	`--status-code`	Show status code
`-ct`	`--content-type`	Show content type
`-ti`	`--title`	Show page title
`-b`	`--body`	Show body preview
`-i`	`--ip`	Show IP addresses
`-f`	`--favicon`	Show favicon hash
`-hr`	`--headers`	Show response headers
`-cl`	`--content-length`	Show content length
`-fr`	`--follow-redirects`	Follow redirects (max 10)
`-cn`	`--cname`	Show CNAME records
`-tls`	`--tls-info`	Show TLS certificate information

Other Options

Option	Long Form	Description
`-to N`	`--timeout N`	Request timeout in seconds (default: 5)
`-mc CODES`	`--match-codes CODES`	Only show specific status codes (comma-separated)
`-ec CODES`	`--exclude-codes CODES`	Exclude specific status codes (comma-separated)
`-p`	`--progress`	Show progress counter
`-ax`	`--axfr`	Try AXFR transfer against nameservers
`-r FILE`	`--resolvers FILE`	File containing DNS resolvers (one per line)