139 lines
4.7 KiB
Markdown
139 lines
4.7 KiB
Markdown
# Rules for handling Data Leaks
|
|
|
|
This normalization framework is designed to standardize data leaks for
|
|
[Eleakxir](https://github.com/anotherhadi/eleakxir), the open-source search
|
|
engine, using
|
|
[leak-utils](https://github.com/anotherhadi/eleakxir/blob/main/leak-utils/README.md),
|
|
a dedicated CLI tool that converts and cleans files for efficient indexing and
|
|
searching.
|
|
|
|
## The Relevance of Parquet for Data Leaks
|
|
|
|
Parquet is an efficient, open-source columnar storage file format designed to
|
|
handle complex data in bulk. When dealing with data leaks, its choice is highly
|
|
relevant for several reasons:
|
|
|
|
- **Compression**: Parquet files offer superior compression compared to
|
|
row-based formats like CSV. By storing data column by column, it applies more
|
|
effective compression algorithms, which significantly reduces disk space. For
|
|
data leaks, where file sizes can range from gigabytes to terabytes, this is
|
|
crucial for minimizing storage costs.
|
|
- **Query Performance**: As a columnar format, Parquet allows you to read only
|
|
the specific columns you need for a query. In a data leak, you might only be
|
|
interested in emails and passwords, not full addresses or phone numbers. This
|
|
selective reading drastically speeds up search operations, as the system
|
|
doesn't have to scan through entire rows of irrelevant data.
|
|
- **Efficiency**: The format is optimized for analytics. It stores data with
|
|
metadata and statistics (min/max values) for each column, allowing for query
|
|
**pruning**. This means a query can skip entire blocks of data that don't
|
|
match the filtering criteria, boosting performance even further.
|
|
|
|
## Disclaimer
|
|
|
|
The information in this document is provided **for research and educational
|
|
purposes only**. I am **not responsible** for how this data, methods, or
|
|
guidelines are used. Any misuse, unlawful activity, or harm resulting from
|
|
applying this content is the sole responsibility of the individual or
|
|
organization using it.
|
|
|
|
## File Naming Convention
|
|
|
|
- **Lowercase only**, ASCII (no accents).
|
|
- **Separators**:
|
|
- `_` inside blocks (`date_2023_10`)
|
|
- `-` between blocks (`instagram.com-date_2023_10`)
|
|
- **Prefix**: always start with the **source name/url** (e.g., `instagram.com`,
|
|
`alien_txt`).
|
|
- **Blocks**: each additional part must be prefixed by its block name:
|
|
|
|
- `date_YYYY[_MM[_DD]]` → use ISO format (year, or year-month, or full date).
|
|
- `source_*` → origin of the leak (e.g., `scrape`, `dump`, `combo`).
|
|
- `version_v*` → versioning if regenerated or transformed.
|
|
- `notes_*` → optional clarifications.
|
|
- **Extension**: always `.parquet`.
|
|
|
|
**Recommended pattern:**
|
|
|
|
```txt
|
|
{source}-date_{YYYY[_MM[_DD]]}-source_{origin}-version_{vN}-notes_{info}.parquet
|
|
```
|
|
|
|
**Examples:**
|
|
|
|
```txt
|
|
instagram.com-date_2023_10.parquet
|
|
alien_txt-date_2022-source_dump.parquet
|
|
combo_french-notes_crypto.parquet
|
|
```
|
|
|
|
## Column Naming Convention
|
|
|
|
- **snake\_case only** (lowercase, `_` separator).
|
|
|
|
- **No dots (`.`)** in column names (`husband.phone` → `husband_phone`).
|
|
|
|
- **Allowed characters**: `[a-z0-9_]+` (no spaces, hyphens, or accents).
|
|
|
|
- **Multiple variants of the same field**:
|
|
|
|
- Relations → prefix clearly: `husband_phone`, `mother_last_name`.
|
|
- Multiples of the same type → numbered prefix: `1_phone`, `2_phone`,
|
|
`3_phone`.
|
|
- Always end with the column "type" (e.g., `_phone`, `_last_name`).
|
|
|
|
- **Rename if mislabeled**: If a `username` column actually contains only emails
|
|
rename it to `email`.
|
|
|
|
- **Remove irrelevant columns**: Drop meaningless identifiers like `id` or
|
|
fields with no analytical value.
|
|
|
|
- **Standard columns**: to enable schema alignment across leaks:
|
|
|
|
| Column |
|
|
| ------------- |
|
|
| email |
|
|
| username |
|
|
| password |
|
|
| password_hash |
|
|
| phone |
|
|
| date |
|
|
| birth_date |
|
|
| age |
|
|
| first_name |
|
|
| last_name |
|
|
| full_name |
|
|
| address |
|
|
| city |
|
|
| country |
|
|
| state |
|
|
| postal_code |
|
|
| ip |
|
|
| url |
|
|
| city |
|
|
|
|
## Standard Column Formatting
|
|
|
|
- **Email**: lowercase, trimmed, keep only `[^a-z0-9._@-]`.
|
|
|
|
- **Phone**: keep only `[^0-9]`
|
|
|
|
- **Names**:
|
|
|
|
- Keep `first_name` / `last_name` if present.
|
|
- Generate `full_name = CONCAT(first_name, ' ', last_name)`.
|
|
- If only `name` exists, rename it to `full_name`.
|
|
|
|
- **Passwords**:
|
|
|
|
- Hashes → `password_hash`.
|
|
- Plaintext → `password`.
|
|
- Never mix hashes and plaintext in the same column.
|
|
|
|
- **NULLs**: always use SQL `NULL` (never `""` or `"NULL"`).
|
|
|
|
## Deduplication
|
|
|
|
Deduplication is often **impractical at scale** (billions of rows). Do **not**
|
|
attempt to deduplicate at ingestion time. Instead, handle deduplication **after
|
|
running a search** to optimize performance and storage.
|