eris/README.md

# Elasticsearch Recon Ingestion Scripts (ERIS)
> A utility for ingesting various large scale reconnaissance data logs into Elasticsearch

The is a suite of tools to aid in the ingestion of recon data from various sources *(httpx, masscan, zonefiles, etc)* into an [Elasticsearch](https://www.elastic.co/elasticsearch) cluster. The entire codebase is designed with asynconous processing, aswell as load balancing ingestion across all of the nodes in your cluster. Additionally, live data ingestion is supported from many of the sources supported. This means data can be directly processed and ingested into your Elasticsearch cluster instantly. The structure allows for the developement of "modules" or "plugins" if you will, to quickly create custom ingestion helpers for anything!

## Prerequisites
- [python](https://www.python.org/)
    - [elasticsearch](https://pypi.org/project/elasticsearch/) *(`pip install elasticsearch`)*
    - [aiofiles](https://pypi.org/project/aiofiles) *(`pip install aiofiles`)*
    - [aiohttp](https://pypi.org/projects/aiohttp) *(`pip install aiohttp`)*

## Usage
```shell
python eris.py [options] <input>
```
**Note:** The `<input>` can be a file or a directory of files, depending on the ingestion script.

### Options
###### General arguments
| Argument     | Description                                   |
|--------------|-----------------------------------------------|
| `input_path` | Path to the input file or directory           |
| `--watch`    | Create or watch a FIFO for real-time indexing |

###### Elasticsearch arguments
| Argument        | Description                                             | Default             |
|-----------------|---------------------------------------------------------|---------------------|
| `--host`        | Elasticsearch host                                      | `http://localhost/` |
| `--port`        | Elasticsearch port                                      | `9200`              |
| `--user`        | Elasticsearch username                                  | `elastic`           |
| `--password`    | Elasticsearch password                                  | `$ES_PASSWORD`      |
| `--api-key`     | Elasticsearch API Key for authentication                | `$ES_APIKEY`        |
| `--self-signed` | Elasticsearch connection with a self-signed certificate |                     |

###### Elasticsearch indexing arguments
| Argument     | Description                          | Default             |
|--------------|--------------------------------------|---------------------|
| `--index`    | Elasticsearch index name             | Depends on ingestor |
| `--pipeline` | Use an ingest pipeline for the index |                     |
| `--replicas` | Number of replicas for the index     | `1`                 |
| `--shards`   | Number of shards for the index       | `1`                 |

###### Performance arguments
| Argument       | Description                                              | Default |
|----------------|----------------------------------------------------------|---------|
| `--chunk-max`  | Maximum size in MB of a chunk                            | `100`   |
| `--chunk-size` | Number of records to index in a chunk                    | `50000` |
| `--retries`    | Number of times to retry indexing a chunk before failing | `100`   |
| `--timeout`    | Number of seconds to wait before retrying a chunk        | `60`    |

###### Ingestion arguments
| Argument    | Description              |
|-------------|--------------------------|
| `--certs`   | Index Certstream records |
| `--httpx`   | Index HTTPX records      |
| `--masscan` | Index Masscan records    |
| `--massdns` | Index massdns records    |
| `--zone`    | Index zone DNS records   |

This ingestion suite will use the built in node sniffer, so by connecting to a single node, you can load balance across the entire cluster.
It is good to know how much nodes you have in the cluster to determine how to fine tune the arguments for the best performance, based on your environment.

## GeoIP Pipeline
Create & add a geoip pipeline and use the following in your index mappings:

```json
"geoip": {
    "city_name": "City",
    "continent_name": "Continent",
    "country_iso_code": "CC",
    "country_name": "Country",
    "location": {
        "lat": 0.0000,
        "lon": 0.0000
    },
    "region_iso_code": "RR",
    "region_name": "Region"
}
```

## Changelog
- Added ingestion script for certificate transparency logs in real time using websockets.
- `--dry-run` removed as this nears production level
- Implemented [async elasticsearch](https://elasticsearch-py.readthedocs.io/en/latest/async.html) into the codebase & refactored some of the logic to accomadate.
- The `--watch` feature now uses a FIFO to do live ingestion.
- Isolated eris.py into it's own file and seperated the ingestion agents into their own modules.

## Roadmap
- Fix issue with `ingest_certs.py` and not needing to pass a file to it.
- Create a module for RIR database ingestion *(WHOIS, delegations, transfer, ASN mapping, peering, etc)*
- Dynamically update the batch metrics when the sniffer adds or removes nodes.

___

###### Mirrors for this repository: [acid.vegas](https://git.acid.vegas/eris) • [SuperNETs](https://git.supernets.org/acidvegas/eris) • [GitHub](https://github.com/acidvegas/eris) • [GitLab](https://gitlab.com/acidvegas/eris) • [Codeberg](https://codeberg.org/acidvegas/eris)
Initial commit 2024-01-20 07:04:50 +00:00			`# Elasticsearch Recon Ingestion Scripts (ERIS)`
Updated README, fixed issue using the wrong domain in records for zone file ingestion (woops) 2024-01-20 15:53:55 +00:00			`> A utility for ingesting various large scale reconnaissance data logs into Elasticsearch`
Initial commit 2024-01-20 07:04:50 +00:00
Introduction paragraph descriving the project overview added, updated roadmap 2024-03-06 03:26:42 +00:00			The is a suite of tools to aid in the ingestion of recon data from various sources (httpx, masscan, zonefiles, etc) into an [Elasticsearch](https://www.elastic.co/elasticsearch) cluster. The entire codebase is designed with asynconous processing, aswell as load balancing ingestion across all of the nodes in your cluster. Additionally, live data ingestion is supported from many of the sources supported. This means data can be directly processed and ingested into your Elasticsearch cluster instantly. The structure allows for the developement of "modules" or "plugins" if you will, to quickly create custom ingestion helpers for anything!

Initial commit 2024-01-20 07:04:50 +00:00			`## Prerequisites`
			`- [python](https://www.python.org/)`
Updated README, fixed issue using the wrong domain in records for zone file ingestion (woops) 2024-01-20 15:53:55 +00:00			- [elasticsearch](https://pypi.org/project/elasticsearch/) (`pip install elasticsearch`)
Asyncronous refactorization of the codebase is complete, testing & metrics and then it will be production ready 2024-03-06 02:40:34 +00:00			- [aiofiles](https://pypi.org/project/aiofiles) (`pip install aiofiles`)
			- [aiohttp](https://pypi.org/projects/aiohttp) (`pip install aiohttp`)
Updated README, fixed issue using the wrong domain in records for zone file ingestion (woops) 2024-01-20 15:53:55 +00:00
			`## Usage`
			```shell
Ingestion agents are now modular, FIFO live ingestion added 2024-02-02 05:11:18 +00:00			`python eris.py [options] <input>`
Updated README, fixed issue using the wrong domain in records for zone file ingestion (woops) 2024-01-20 15:53:55 +00:00			```
			Note: The `<input>` can be a file or a directory of files, depending on the ingestion script.
Initial commit 2024-01-20 07:04:50 +00:00
Ingestion agents are now modular, FIFO live ingestion added 2024-02-02 05:11:18 +00:00			`### Options`
Added parallel bulk uploading, error handling, sniffing nodes for discovery, dynamic batch sizes, and more 2024-01-27 06:13:11 +00:00			`###### General arguments`
Asyncronous refactorization of the codebase is complete, testing & metrics and then it will be production ready 2024-03-06 02:40:34 +00:00			`\| Argument \| Description \|`
			`\|--------------\|-----------------------------------------------\|`
			\| `input_path` \| Path to the input file or directory \|
			\| `--watch` \| Create or watch a FIFO for real-time indexing \|
Added parallel bulk uploading, error handling, sniffing nodes for discovery, dynamic batch sizes, and more 2024-01-27 06:13:11 +00:00
			`###### Elasticsearch arguments`
Asyncronous refactorization of the codebase is complete, testing & metrics and then it will be production ready 2024-03-06 02:40:34 +00:00			`\| Argument \| Description \| Default \|`
			`\|-----------------\|---------------------------------------------------------\|---------------------\|`
			\| `--host` \| Elasticsearch host \| `http://localhost/` \|
			\| `--port` \| Elasticsearch port \| `9200` \|
			\| `--user` \| Elasticsearch username \| `elastic` \|
			\| `--password` \| Elasticsearch password \| `$ES_PASSWORD` \|
			\| `--api-key` \| Elasticsearch API Key for authentication \| `$ES_APIKEY` \|
			\| `--self-signed` \| Elasticsearch connection with a self-signed certificate \| \|
Updated README, copied over consistencies across the ingestors, docstring updates to reflect on new arguments 2024-01-27 09:28:30 +00:00
			`###### Elasticsearch indexing arguments`
Asyncronous refactorization of the codebase is complete, testing & metrics and then it will be production ready 2024-03-06 02:40:34 +00:00			`\| Argument \| Description \| Default \|`
			`\|--------------\|--------------------------------------\|---------------------\|`
			\| `--index` \| Elasticsearch index name \| Depends on ingestor \|
			\| `--pipeline` \| Use an ingest pipeline for the index \| \|
			\| `--replicas` \| Number of replicas for the index \| `1` \|
			\| `--shards` \| Number of shards for the index \| `1` \|
Added parallel bulk uploading, error handling, sniffing nodes for discovery, dynamic batch sizes, and more 2024-01-27 06:13:11 +00:00
			`###### Performance arguments`
Updated cause I am OCD about spaces and formatting 2024-03-06 03:29:31 +00:00			`\| Argument \| Description \| Default \|`
			`\|----------------\|----------------------------------------------------------\|---------\|`
			\| `--chunk-max` \| Maximum size in MB of a chunk \| `100` \|
			\| `--chunk-size` \| Number of records to index in a chunk \| `50000` \|
			\| `--retries` \| Number of times to retry indexing a chunk before failing \| `100` \|
			\| `--timeout` \| Number of seconds to wait before retrying a chunk \| `60` \|
Ingestion agents are now modular, FIFO live ingestion added 2024-02-02 05:11:18 +00:00
			`###### Ingestion arguments`
Asyncronous refactorization of the codebase is complete, testing & metrics and then it will be production ready 2024-03-06 02:40:34 +00:00			`\| Argument \| Description \|`
			`\|-------------\|--------------------------\|`
			\| `--certs` \| Index Certstream records \|
			\| `--httpx` \| Index HTTPX records \|
			\| `--masscan` \| Index Masscan records \|
			\| `--massdns` \| Index massdns records \|
			\| `--zone` \| Index zone DNS records \|
Updated README, fixed issue using the wrong domain in records for zone file ingestion (woops) 2024-01-20 15:53:55 +00:00
Ingestion agents are now modular, FIFO live ingestion added 2024-02-02 05:11:18 +00:00			`This ingestion suite will use the built in node sniffer, so by connecting to a single node, you can load balance across the entire cluster.`
			`It is good to know how much nodes you have in the cluster to determine how to fine tune the arguments for the best performance, based on your environment.`

Started asyncronous implementation of bulk streaming data, altered ERIS defaults, etc 2024-03-04 22:44:09 +00:00			`## GeoIP Pipeline`
			`Create & add a geoip pipeline and use the following in your index mappings:`

			```json
			`"geoip": {`
			`"city_name": "City",`
			`"continent_name": "Continent",`
			`"country_iso_code": "CC",`
			`"country_name": "Country",`
			`"location": {`
			`"lat": 0.0000,`
			`"lon": 0.0000`
			`},`
			`"region_iso_code": "RR",`
			`"region_name": "Region"`
			`}`
			```

Ingestion agents are now modular, FIFO live ingestion added 2024-02-02 05:11:18 +00:00			`## Changelog`
Asyncronous refactorization of the codebase is complete, testing & metrics and then it will be production ready 2024-03-06 02:40:34 +00:00			`- Added ingestion script for certificate transparency logs in real time using websockets.`
			- `--dry-run` removed as this nears production level
			`- Implemented [async elasticsearch](https://elasticsearch-py.readthedocs.io/en/latest/async.html) into the codebase & refactored some of the logic to accomadate.`
Ingestion agents are now modular, FIFO live ingestion added 2024-02-02 05:11:18 +00:00			- The `--watch` feature now uses a FIFO to do live ingestion.
			`- Isolated eris.py into it's own file and seperated the ingestion agents into their own modules.`

Started asyncronous implementation of bulk streaming data, altered ERIS defaults, etc 2024-03-04 22:44:09 +00:00			`## Roadmap`
Introduction paragraph descriving the project overview added, updated roadmap 2024-03-06 03:26:42 +00:00			- Fix issue with `ingest_certs.py` and not needing to pass a file to it.
			`- Create a module for RIR database ingestion (WHOIS, delegations, transfer, ASN mapping, peering, etc)`
			`- Dynamically update the batch metrics when the sniffer adds or removes nodes.`
Started asyncronous implementation of bulk streaming data, altered ERIS defaults, etc 2024-03-04 22:44:09 +00:00
Initial commit 2024-01-20 07:04:50 +00:00			`___`

Updated mirrors 2024-01-21 02:37:27 +00:00			`###### Mirrors for this repository: [acid.vegas](https://git.acid.vegas/eris) • [SuperNETs](https://git.supernets.org/acidvegas/eris) • [GitHub](https://github.com/acidvegas/eris) • [GitLab](https://gitlab.com/acidvegas/eris) • [Codeberg](https://codeberg.org/acidvegas/eris)`