init
This commit is contained in:
138
leak-utils/DATALEAKS-NORMALIZATION.md
Normal file
138
leak-utils/DATALEAKS-NORMALIZATION.md
Normal file
@@ -0,0 +1,138 @@
|
||||
# Rules for handling Data Leaks
|
||||
|
||||
This normalization framework is designed to standardize data leaks for
|
||||
[Eleakxir](https://github.com/anotherhadi/eleakxir), the open-source search
|
||||
engine, using
|
||||
[leak-utils](https://github.com/anotherhadi/eleakxir-temp/blob/main/leak-utils/README.md),
|
||||
a dedicated CLI tool that converts and cleans files for efficient indexing and
|
||||
searching.
|
||||
|
||||
## The Relevance of Parquet for Data Leaks
|
||||
|
||||
Parquet is an efficient, open-source columnar storage file format designed to
|
||||
handle complex data in bulk. When dealing with data leaks, its choice is highly
|
||||
relevant for several reasons:
|
||||
|
||||
- **Compression**: Parquet files offer superior compression compared to
|
||||
row-based formats like CSV. By storing data column by column, it applies more
|
||||
effective compression algorithms, which significantly reduces disk space. For
|
||||
data leaks, where file sizes can range from gigabytes to terabytes, this is
|
||||
crucial for minimizing storage costs.
|
||||
- **Query Performance**: As a columnar format, Parquet allows you to read only
|
||||
the specific columns you need for a query. In a data leak, you might only be
|
||||
interested in emails and passwords, not full addresses or phone numbers. This
|
||||
selective reading drastically speeds up search operations, as the system
|
||||
doesn't have to scan through entire rows of irrelevant data.
|
||||
- **Efficiency**: The format is optimized for analytics. It stores data with
|
||||
metadata and statistics (min/max values) for each column, allowing for query
|
||||
**pruning**. This means a query can skip entire blocks of data that don't
|
||||
match the filtering criteria, boosting performance even further.
|
||||
|
||||
## Disclaimer
|
||||
|
||||
The information in this document is provided **for research and educational
|
||||
purposes only**. I am **not responsible** for how this data, methods, or
|
||||
guidelines are used. Any misuse, unlawful activity, or harm resulting from
|
||||
applying this content is the sole responsibility of the individual or
|
||||
organization using it.
|
||||
|
||||
## File Naming Convention
|
||||
|
||||
- **Lowercase only**, ASCII (no accents).
|
||||
- **Separators**:
|
||||
- `_` inside blocks (`date_2023_10`)
|
||||
- `-` between blocks (`instagram.com-date_2023_10`)
|
||||
- **Prefix**: always start with the **source name/url** (e.g., `instagram.com`,
|
||||
`alien_txt`).
|
||||
- **Blocks**: each additional part must be prefixed by its block name:
|
||||
|
||||
- `date_YYYY[_MM[_DD]]` → use ISO format (year, or year-month, or full date).
|
||||
- `source_*` → origin of the leak (e.g., `scrape`, `dump`, `combo`).
|
||||
- `version_v*` → versioning if regenerated or transformed.
|
||||
- `notes_*` → optional clarifications.
|
||||
- **Extension**: always `.parquet`.
|
||||
|
||||
**Recommended pattern:**
|
||||
|
||||
```txt
|
||||
{source}-date_{YYYY[_MM[_DD]]}-source_{origin}-version_{vN}-notes_{info}.parquet
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
|
||||
```txt
|
||||
instagram.com-date_2023_10.parquet
|
||||
alien_txt-date_2022-source_dump.parquet
|
||||
combo_french-notes_crypto.parquet
|
||||
```
|
||||
|
||||
## Column Naming Convention
|
||||
|
||||
- **snake\_case only** (lowercase, `_` separator).
|
||||
|
||||
- **No dots (`.`)** in column names (`husband.phone` → `husband_phone`).
|
||||
|
||||
- **Allowed characters**: `[a-z0-9_]+` (no spaces, hyphens, or accents).
|
||||
|
||||
- **Multiple variants of the same field**:
|
||||
|
||||
- Relations → prefix clearly: `husband_phone`, `mother_last_name`.
|
||||
- Multiples of the same type → numbered prefix: `1_phone`, `2_phone`,
|
||||
`3_phone`.
|
||||
- Always end with the column "type" (e.g., `_phone`, `_last_name`).
|
||||
|
||||
- **Rename if mislabeled**: If a `username` column actually contains only emails
|
||||
rename it to `email`.
|
||||
|
||||
- **Remove irrelevant columns**: Drop meaningless identifiers like `id` or
|
||||
fields with no analytical value.
|
||||
|
||||
- **Standard columns**: to enable schema alignment across leaks:
|
||||
|
||||
| Column |
|
||||
| ------------- |
|
||||
| email |
|
||||
| username |
|
||||
| password |
|
||||
| password_hash |
|
||||
| phone |
|
||||
| date |
|
||||
| birth_date |
|
||||
| age |
|
||||
| first_name |
|
||||
| last_name |
|
||||
| full_name |
|
||||
| address |
|
||||
| city |
|
||||
| country |
|
||||
| state |
|
||||
| postal_code |
|
||||
| ip |
|
||||
| url |
|
||||
| city |
|
||||
|
||||
## Standard Column Formatting
|
||||
|
||||
- **Email**: lowercase, trimmed, keep only `[^a-z0-9._@-]`.
|
||||
|
||||
- **Phone**: keep only `[^0-9]`
|
||||
|
||||
- **Names**:
|
||||
|
||||
- Keep `first_name` / `last_name` if present.
|
||||
- Generate `full_name = CONCAT(first_name, ' ', last_name)`.
|
||||
- If only `name` exists, rename it to `full_name`.
|
||||
|
||||
- **Passwords**:
|
||||
|
||||
- Hashes → `password_hash`.
|
||||
- Plaintext → `password`.
|
||||
- Never mix hashes and plaintext in the same column.
|
||||
|
||||
- **NULLs**: always use SQL `NULL` (never `""` or `"NULL"`).
|
||||
|
||||
## Deduplication
|
||||
|
||||
Deduplication is often **impractical at scale** (billions of rows). Do **not**
|
||||
attempt to deduplicate at ingestion time. Instead, handle deduplication **after
|
||||
running a search** to optimize performance and storage.
|
||||
107
leak-utils/README.md
Normal file
107
leak-utils/README.md
Normal file
@@ -0,0 +1,107 @@
|
||||
# 🛠 leak-utils: The Eleakxir Data Utility Toolkit
|
||||
|
||||
`leak-utils` is a powerful command-line tool built to help you manage, process,
|
||||
and optimize data leaks for use with the **Eleakxir** search engine. It provides
|
||||
a suite of utilities for data cleaning, format conversion, and file
|
||||
manipulation, all designed to ensure your data wells are efficient and
|
||||
standardized.
|
||||
|
||||
`leak-utils` is written in **Go** and leverages **DuckDB** for its
|
||||
high-performance in-memory processing, ensuring fast and reliable operations on
|
||||
large datasets.
|
||||
|
||||
## 🚀 Features
|
||||
|
||||
- **Parquet File Management**: Clean and inspect existing `.parquet` files.
|
||||
- **Format Conversion**: Seamlessly convert `.csv`, `.txt`, `.json` files into
|
||||
the optimized `.parquet` format.
|
||||
- **Schema Uniformity**: Tools designed to help you standardize and normalize
|
||||
your data to align with the
|
||||
[Eleakxir data leak normalization rules](./DATALEAKS-NORMALIZATION.md). This
|
||||
ensures a consistent schema across all your files, which is crucial for
|
||||
efficient searching and consistent results.
|
||||
- **High Performance**: Built with Go and DuckDB for fast and efficient data
|
||||
processing.
|
||||
|
||||
## ⚙️ How to Use
|
||||
|
||||
The tool operates via a single executable with different commands, each
|
||||
corresponding to a specific action. You can find the executable in the
|
||||
`leak-utils` directory of the Eleakxir project.
|
||||
|
||||
### Install
|
||||
|
||||
#### With go
|
||||
|
||||
```bash
|
||||
go install "github.com/anotherhadi/eleakxir/leak-utils@latest"
|
||||
```
|
||||
|
||||
#### With Nix/NixOS
|
||||
|
||||
<details>
|
||||
<summary>Click to expand</summary>
|
||||
|
||||
**From anywhere (using the repo URL):**
|
||||
|
||||
```bash
|
||||
nix run "github:anotherhadi/eleakxir#leak-utils" -- action [--flags value]
|
||||
```
|
||||
|
||||
**Permanent Installation:**
|
||||
|
||||
```bash
|
||||
# add the flake to your flake.nix
|
||||
{
|
||||
inputs = {
|
||||
eleakxir.url = "github:anotherhadi/eleakxir";
|
||||
};
|
||||
}
|
||||
|
||||
# then add it to your packages
|
||||
environment.systemPackages = with pkgs; [ # or home.packages
|
||||
eleakxir.packages.${pkgs.system}.leak-utils
|
||||
];
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Available Actions
|
||||
|
||||
#### `cleanParquet`
|
||||
|
||||
Optimizes and cleans an existing Parquet file. This can be used to change
|
||||
columns, clean rows, ...
|
||||
|
||||
See:
|
||||
|
||||
```bash
|
||||
leak-utils cleanParquet --help
|
||||
```
|
||||
|
||||
#### `infoParquet`
|
||||
|
||||
Displays metadata and schema information for a given Parquet file. Useful for
|
||||
inspecting file structure and column types.
|
||||
|
||||
#### `csvToParquet`
|
||||
|
||||
Converts a `.csv` file into a highly compressed and efficient `.parquet` file.
|
||||
This is the recommended way to prepare your data for Eleakxir.
|
||||
|
||||
#### `mergeFiles`
|
||||
|
||||
Merges multiple files (of the same type) into a single, larger file. This is
|
||||
useful for combining smaller data leaks.
|
||||
|
||||
#### `removeUrlSchemeFromUlp`
|
||||
|
||||
This utility prevents the colon (`:`) in URL schemes like `https://` from being
|
||||
mistakenly parsed as a column separator when processing ULP data in flat files
|
||||
like CSV or TXT.
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
[Contributions](../CONTRIBUTING.md) to `leak-utils` are welcome! Feel free to
|
||||
open issues or submit pull requests for new features, bug fixes, or performance
|
||||
improvements.
|
||||
42
leak-utils/go.mod
Normal file
42
leak-utils/go.mod
Normal file
@@ -0,0 +1,42 @@
|
||||
module github.com/anotherhadi/eleakxir/leak-utils
|
||||
|
||||
go 1.25.0
|
||||
|
||||
require (
|
||||
github.com/charmbracelet/lipgloss/v2 v2.0.0-beta1
|
||||
github.com/charmbracelet/log v0.4.2
|
||||
github.com/marcboeker/go-duckdb v1.8.5
|
||||
github.com/spf13/pflag v1.0.10
|
||||
)
|
||||
|
||||
require (
|
||||
github.com/apache/arrow-go/v18 v18.1.0 // indirect
|
||||
github.com/aymanbagabas/go-osc52/v2 v2.0.1 // indirect
|
||||
github.com/charmbracelet/colorprofile v0.3.0 // indirect
|
||||
github.com/charmbracelet/lipgloss v1.1.0 // indirect
|
||||
github.com/charmbracelet/x/ansi v0.8.0 // indirect
|
||||
github.com/charmbracelet/x/cellbuf v0.0.13 // indirect
|
||||
github.com/charmbracelet/x/term v0.2.1 // indirect
|
||||
github.com/go-logfmt/logfmt v0.6.0 // indirect
|
||||
github.com/go-viper/mapstructure/v2 v2.2.1 // indirect
|
||||
github.com/goccy/go-json v0.10.5 // indirect
|
||||
github.com/google/flatbuffers v25.1.24+incompatible // indirect
|
||||
github.com/google/uuid v1.6.0 // indirect
|
||||
github.com/klauspost/compress v1.17.11 // indirect
|
||||
github.com/klauspost/cpuid/v2 v2.2.9 // indirect
|
||||
github.com/lucasb-eyer/go-colorful v1.2.0 // indirect
|
||||
github.com/mattn/go-isatty v0.0.20 // indirect
|
||||
github.com/mattn/go-runewidth v0.0.16 // indirect
|
||||
github.com/muesli/cancelreader v0.2.2 // indirect
|
||||
github.com/muesli/termenv v0.16.0 // indirect
|
||||
github.com/pierrec/lz4/v4 v4.1.22 // indirect
|
||||
github.com/rivo/uniseg v0.4.7 // indirect
|
||||
github.com/xo/terminfo v0.0.0-20220910002029-abceb7e1c41e // indirect
|
||||
github.com/zeebo/xxh3 v1.0.2 // indirect
|
||||
golang.org/x/exp v0.0.0-20250128182459-e0ece0dbea4c // indirect
|
||||
golang.org/x/mod v0.22.0 // indirect
|
||||
golang.org/x/sync v0.10.0 // indirect
|
||||
golang.org/x/sys v0.31.0 // indirect
|
||||
golang.org/x/tools v0.29.0 // indirect
|
||||
golang.org/x/xerrors v0.0.0-20240903120638-7835f813f4da // indirect
|
||||
)
|
||||
94
leak-utils/go.sum
Normal file
94
leak-utils/go.sum
Normal file
@@ -0,0 +1,94 @@
|
||||
github.com/andybalholm/brotli v1.1.1 h1:PR2pgnyFznKEugtsUo0xLdDop5SKXd5Qf5ysW+7XdTA=
|
||||
github.com/andybalholm/brotli v1.1.1/go.mod h1:05ib4cKhjx3OQYUY22hTVd34Bc8upXjOLL2rKwwZBoA=
|
||||
github.com/apache/arrow-go/v18 v18.1.0 h1:agLwJUiVuwXZdwPYVrlITfx7bndULJ/dggbnLFgDp/Y=
|
||||
github.com/apache/arrow-go/v18 v18.1.0/go.mod h1:tigU/sIgKNXaesf5d7Y95jBBKS5KsxTqYBKXFsvKzo0=
|
||||
github.com/apache/thrift v0.21.0 h1:tdPmh/ptjE1IJnhbhrcl2++TauVjy242rkV/UzJChnE=
|
||||
github.com/apache/thrift v0.21.0/go.mod h1:W1H8aR/QRtYNvrPeFXBtobyRkd0/YVhTc6i07XIAgDw=
|
||||
github.com/aymanbagabas/go-osc52/v2 v2.0.1 h1:HwpRHbFMcZLEVr42D4p7XBqjyuxQH5SMiErDT4WkJ2k=
|
||||
github.com/aymanbagabas/go-osc52/v2 v2.0.1/go.mod h1:uYgXzlJ7ZpABp8OJ+exZzJJhRNQ2ASbcXHWsFqH8hp8=
|
||||
github.com/charmbracelet/colorprofile v0.3.0 h1:KtLh9uuu1RCt+Hml4s6Hz+kB1PfV3wi++1h5ia65yKQ=
|
||||
github.com/charmbracelet/colorprofile v0.3.0/go.mod h1:oHJ340RS2nmG1zRGPmhJKJ/jf4FPNNk0P39/wBPA1G0=
|
||||
github.com/charmbracelet/lipgloss v1.1.0 h1:vYXsiLHVkK7fp74RkV7b2kq9+zDLoEU4MZoFqR/noCY=
|
||||
github.com/charmbracelet/lipgloss v1.1.0/go.mod h1:/6Q8FR2o+kj8rz4Dq0zQc3vYf7X+B0binUUBwA0aL30=
|
||||
github.com/charmbracelet/lipgloss/v2 v2.0.0-beta1 h1:SOylT6+BQzPHEjn15TIzawBPVD0QmhKXbcb3jY0ZIKU=
|
||||
github.com/charmbracelet/lipgloss/v2 v2.0.0-beta1/go.mod h1:tRlx/Hu0lo/j9viunCN2H+Ze6JrmdjQlXUQvvArgaOc=
|
||||
github.com/charmbracelet/log v0.4.2 h1:hYt8Qj6a8yLnvR+h7MwsJv/XvmBJXiueUcI3cIxsyig=
|
||||
github.com/charmbracelet/log v0.4.2/go.mod h1:qifHGX/tc7eluv2R6pWIpyHDDrrb/AG71Pf2ysQu5nw=
|
||||
github.com/charmbracelet/x/ansi v0.8.0 h1:9GTq3xq9caJW8ZrBTe0LIe2fvfLR/bYXKTx2llXn7xE=
|
||||
github.com/charmbracelet/x/ansi v0.8.0/go.mod h1:wdYl/ONOLHLIVmQaxbIYEC/cRKOQyjTkowiI4blgS9Q=
|
||||
github.com/charmbracelet/x/cellbuf v0.0.13 h1:/KBBKHuVRbq1lYx5BzEHBAFBP8VcQzJejZ/IA3iR28k=
|
||||
github.com/charmbracelet/x/cellbuf v0.0.13/go.mod h1:xe0nKWGd3eJgtqZRaN9RjMtK7xUYchjzPr7q6kcvCCs=
|
||||
github.com/charmbracelet/x/term v0.2.1 h1:AQeHeLZ1OqSXhrAWpYUtZyX1T3zVxfpZuEQMIQaGIAQ=
|
||||
github.com/charmbracelet/x/term v0.2.1/go.mod h1:oQ4enTYFV7QN4m0i9mzHrViD7TQKvNEEkHUMCmsxdUg=
|
||||
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
|
||||
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/go-logfmt/logfmt v0.6.0 h1:wGYYu3uicYdqXVgoYbvnkrPVXkuLM1p1ifugDMEdRi4=
|
||||
github.com/go-logfmt/logfmt v0.6.0/go.mod h1:WYhtIu8zTZfxdn5+rREduYbwxfcBr/Vr6KEVveWlfTs=
|
||||
github.com/go-viper/mapstructure/v2 v2.2.1 h1:ZAaOCxANMuZx5RCeg0mBdEZk7DZasvvZIxtHqx8aGss=
|
||||
github.com/go-viper/mapstructure/v2 v2.2.1/go.mod h1:oJDH3BJKyqBA2TXFhDsKDGDTlndYOZ6rGS0BRZIxGhM=
|
||||
github.com/goccy/go-json v0.10.5 h1:Fq85nIqj+gXn/S5ahsiTlK3TmC85qgirsdTP/+DeaC4=
|
||||
github.com/goccy/go-json v0.10.5/go.mod h1:oq7eo15ShAhp70Anwd5lgX2pLfOS3QCiwU/PULtXL6M=
|
||||
github.com/golang/snappy v0.0.4 h1:yAGX7huGHXlcLOEtBnF4w7FQwA26wojNCwOYAEhLjQM=
|
||||
github.com/golang/snappy v0.0.4/go.mod h1:/XxbfmMg8lxefKM7IXC3fBNl/7bRcc72aCRzEWrmP2Q=
|
||||
github.com/google/flatbuffers v25.1.24+incompatible h1:4wPqL3K7GzBd1CwyhSd3usxLKOaJN/AC6puCca6Jm7o=
|
||||
github.com/google/flatbuffers v25.1.24+incompatible/go.mod h1:1AeVuKshWv4vARoZatz6mlQ0JxURH0Kv5+zNeJKJCa8=
|
||||
github.com/google/go-cmp v0.6.0 h1:ofyhxvXcZhMsU5ulbFiLKl/XBFqE1GSq7atu8tAmTRI=
|
||||
github.com/google/go-cmp v0.6.0/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY=
|
||||
github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
|
||||
github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
|
||||
github.com/klauspost/asmfmt v1.3.2 h1:4Ri7ox3EwapiOjCki+hw14RyKk201CN4rzyCJRFLpK4=
|
||||
github.com/klauspost/asmfmt v1.3.2/go.mod h1:AG8TuvYojzulgDAMCnYn50l/5QV3Bs/tp6j0HLHbNSE=
|
||||
github.com/klauspost/compress v1.17.11 h1:In6xLpyWOi1+C7tXUUWv2ot1QvBjxevKAaI6IXrJmUc=
|
||||
github.com/klauspost/compress v1.17.11/go.mod h1:pMDklpSncoRMuLFrf1W9Ss9KT+0rH90U12bZKk7uwG0=
|
||||
github.com/klauspost/cpuid/v2 v2.2.9 h1:66ze0taIn2H33fBvCkXuv9BmCwDfafmiIVpKV9kKGuY=
|
||||
github.com/klauspost/cpuid/v2 v2.2.9/go.mod h1:rqkxqrZ1EhYM9G+hXH7YdowN5R5RGN6NK4QwQ3WMXF8=
|
||||
github.com/lucasb-eyer/go-colorful v1.2.0 h1:1nnpGOrhyZZuNyfu1QjKiUICQ74+3FNCN69Aj6K7nkY=
|
||||
github.com/lucasb-eyer/go-colorful v1.2.0/go.mod h1:R4dSotOR9KMtayYi1e77YzuveK+i7ruzyGqttikkLy0=
|
||||
github.com/marcboeker/go-duckdb v1.8.5 h1:tkYp+TANippy0DaIOP5OEfBEwbUINqiFqgwMQ44jME0=
|
||||
github.com/marcboeker/go-duckdb v1.8.5/go.mod h1:6mK7+WQE4P4u5AFLvVBmhFxY5fvhymFptghgJX6B+/8=
|
||||
github.com/mattn/go-isatty v0.0.20 h1:xfD0iDuEKnDkl03q4limB+vH+GxLEtL/jb4xVJSWWEY=
|
||||
github.com/mattn/go-isatty v0.0.20/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D7dTCTo3Y=
|
||||
github.com/mattn/go-runewidth v0.0.16 h1:E5ScNMtiwvlvB5paMFdw9p4kSQzbXFikJ5SQO6TULQc=
|
||||
github.com/mattn/go-runewidth v0.0.16/go.mod h1:Jdepj2loyihRzMpdS35Xk/zdY8IAYHsh153qUoGf23w=
|
||||
github.com/minio/asm2plan9s v0.0.0-20200509001527-cdd76441f9d8 h1:AMFGa4R4MiIpspGNG7Z948v4n35fFGB3RR3G/ry4FWs=
|
||||
github.com/minio/asm2plan9s v0.0.0-20200509001527-cdd76441f9d8/go.mod h1:mC1jAcsrzbxHt8iiaC+zU4b1ylILSosueou12R++wfY=
|
||||
github.com/minio/c2goasm v0.0.0-20190812172519-36a3d3bbc4f3 h1:+n/aFZefKZp7spd8DFdX7uMikMLXX4oubIzJF4kv/wI=
|
||||
github.com/minio/c2goasm v0.0.0-20190812172519-36a3d3bbc4f3/go.mod h1:RagcQ7I8IeTMnF8JTXieKnO4Z6JCsikNEzj0DwauVzE=
|
||||
github.com/muesli/cancelreader v0.2.2 h1:3I4Kt4BQjOR54NavqnDogx/MIoWBFa0StPA8ELUXHmA=
|
||||
github.com/muesli/cancelreader v0.2.2/go.mod h1:3XuTXfFS2VjM+HTLZY9Ak0l6eUKfijIfMUZ4EgX0QYo=
|
||||
github.com/muesli/termenv v0.16.0 h1:S5AlUN9dENB57rsbnkPyfdGuWIlkmzJjbFf0Tf5FWUc=
|
||||
github.com/muesli/termenv v0.16.0/go.mod h1:ZRfOIKPFDYQoDFF4Olj7/QJbW60Ol/kL1pU3VfY/Cnk=
|
||||
github.com/pierrec/lz4/v4 v4.1.22 h1:cKFw6uJDK+/gfw5BcDL0JL5aBsAFdsIT18eRtLj7VIU=
|
||||
github.com/pierrec/lz4/v4 v4.1.22/go.mod h1:gZWDp/Ze/IJXGXf23ltt2EXimqmTUXEy0GFuRQyBid4=
|
||||
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
|
||||
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
|
||||
github.com/rivo/uniseg v0.2.0/go.mod h1:J6wj4VEh+S6ZtnVlnTBMWIodfgj8LQOQFoIToxlJtxc=
|
||||
github.com/rivo/uniseg v0.4.7 h1:WUdvkW8uEhrYfLC4ZzdpI2ztxP1I582+49Oc5Mq64VQ=
|
||||
github.com/rivo/uniseg v0.4.7/go.mod h1:FN3SvrM+Zdj16jyLfmOkMNblXMcoc8DfTHruCPUcx88=
|
||||
github.com/spf13/pflag v1.0.10 h1:4EBh2KAYBwaONj6b2Ye1GiHfwjqyROoF4RwYO+vPwFk=
|
||||
github.com/spf13/pflag v1.0.10/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
|
||||
github.com/stretchr/testify v1.10.0 h1:Xv5erBjTwe/5IxqUQTdXv5kgmIvbHo3QQyRwhJsOfJA=
|
||||
github.com/stretchr/testify v1.10.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY=
|
||||
github.com/xo/terminfo v0.0.0-20220910002029-abceb7e1c41e h1:JVG44RsyaB9T2KIHavMF/ppJZNG9ZpyihvCd0w101no=
|
||||
github.com/xo/terminfo v0.0.0-20220910002029-abceb7e1c41e/go.mod h1:RbqR21r5mrJuqunuUZ/Dhy/avygyECGrLceyNeo4LiM=
|
||||
github.com/zeebo/assert v1.3.0 h1:g7C04CbJuIDKNPFHmsk4hwZDO5O+kntRxzaUoNXj+IQ=
|
||||
github.com/zeebo/assert v1.3.0/go.mod h1:Pq9JiuJQpG8JLJdtkwrJESF0Foym2/D9XMU5ciN/wJ0=
|
||||
github.com/zeebo/xxh3 v1.0.2 h1:xZmwmqxHZA8AI603jOQ0tMqmBr9lPeFwGg6d+xy9DC0=
|
||||
github.com/zeebo/xxh3 v1.0.2/go.mod h1:5NWz9Sef7zIDm2JHfFlcQvNekmcEl9ekUZQQKCYaDcA=
|
||||
golang.org/x/exp v0.0.0-20250128182459-e0ece0dbea4c h1:KL/ZBHXgKGVmuZBZ01Lt57yE5ws8ZPSkkihmEyq7FXc=
|
||||
golang.org/x/exp v0.0.0-20250128182459-e0ece0dbea4c/go.mod h1:tujkw807nyEEAamNbDrEGzRav+ilXA7PCRAd6xsmwiU=
|
||||
golang.org/x/mod v0.22.0 h1:D4nJWe9zXqHOmWqj4VMOJhvzj7bEZg4wEYa759z1pH4=
|
||||
golang.org/x/mod v0.22.0/go.mod h1:6SkKJ3Xj0I0BrPOZoBy3bdMptDDU9oJrpohJ3eWZ1fY=
|
||||
golang.org/x/sync v0.10.0 h1:3NQrjDixjgGwUOCaF8w2+VYHv0Ve/vGYSbdkTa98gmQ=
|
||||
golang.org/x/sync v0.10.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk=
|
||||
golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/sys v0.31.0 h1:ioabZlmFYtWhL+TRYpcnNlLwhyxaM9kWTDEmfnprqik=
|
||||
golang.org/x/sys v0.31.0/go.mod h1:BJP2sWEmIv4KK5OTEluFJCKSidICx8ciO85XgH3Ak8k=
|
||||
golang.org/x/tools v0.29.0 h1:Xx0h3TtM9rzQpQuR4dKLrdglAmCEN5Oi+P74JdhdzXE=
|
||||
golang.org/x/tools v0.29.0/go.mod h1:KMQVMRsVxU6nHCFXrBPhDB8XncLNLM0lIy/F14RP588=
|
||||
golang.org/x/xerrors v0.0.0-20240903120638-7835f813f4da h1:noIWHXmPHxILtqtCOPIhSt0ABwskkZKjD3bXGnZGpNY=
|
||||
golang.org/x/xerrors v0.0.0-20240903120638-7835f813f4da/go.mod h1:NDW/Ps6MPRej6fsCIbMTohpP40sJ/P/vI1MoTEGwX90=
|
||||
gonum.org/v1/gonum v0.15.1 h1:FNy7N6OUZVUaWG9pTiD+jlhdQ3lMP+/LcTpJ6+a8sQ0=
|
||||
gonum.org/v1/gonum v0.15.1/go.mod h1:eZTZuRFrzu5pcyjN5wJhcIhnUdNijYxX1T2IcrOGY0o=
|
||||
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
|
||||
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
145
leak-utils/leak-utils/main.go
Normal file
145
leak-utils/leak-utils/main.go
Normal file
@@ -0,0 +1,145 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"database/sql"
|
||||
"fmt"
|
||||
"os"
|
||||
"slices"
|
||||
"strings"
|
||||
|
||||
"github.com/anotherhadi/eleakxir/leak-utils/misc"
|
||||
"github.com/anotherhadi/eleakxir/leak-utils/parquet"
|
||||
"github.com/anotherhadi/eleakxir/leak-utils/settings"
|
||||
"github.com/charmbracelet/log"
|
||||
_ "github.com/marcboeker/go-duckdb"
|
||||
flag "github.com/spf13/pflag"
|
||||
)
|
||||
|
||||
func main() {
|
||||
db, err := sql.Open("duckdb", "")
|
||||
if err != nil {
|
||||
log.Fatal("Failed to open DuckDB", "error", err)
|
||||
}
|
||||
defer db.Close()
|
||||
lu := settings.LeakUtils{
|
||||
Db: db,
|
||||
}
|
||||
actions := []string{
|
||||
"cleanParquet",
|
||||
"infoParquet",
|
||||
// Csv
|
||||
"csvToParquet",
|
||||
// Misc
|
||||
"mergeFiles",
|
||||
"removeUrlSchemeFromUlp",
|
||||
}
|
||||
|
||||
if len(os.Args) < 2 {
|
||||
fmt.Println(settings.Muted.Render("Usage: "), settings.Accent.Render(os.Args[0], "<action>"))
|
||||
fmt.Println(settings.Muted.Render("Actions: "), settings.Base.Render(strings.Join(actions, ", ")))
|
||||
return
|
||||
}
|
||||
action := os.Args[1]
|
||||
if !slices.Contains(actions, action) {
|
||||
log.Fatal("Unknown action", "action", action)
|
||||
}
|
||||
|
||||
switch action {
|
||||
case "cleanParquet":
|
||||
var inputFile *string = flag.StringP("input", "i", "", "Input Parquet file")
|
||||
var outputFile *string = flag.StringP("output", "o", "", "Output Parquet file")
|
||||
var compression *string = flag.StringP("compression", "c", "ZSTD", "Compression codec (UNCOMPRESSED, SNAPPY, GZIP, BROTLI, LZ4, ZSTD)")
|
||||
var skipLineFormating *bool = flag.BoolP("skip-line-formating", "s", false, "Skip line formating")
|
||||
var deleteFirstRow *bool = flag.Bool("delete-first-row", false, "Delete first row")
|
||||
var debug *bool = flag.Bool("debug", false, "Debug mode")
|
||||
var noColors *bool = flag.Bool("no-colors", false, "Remove all colors")
|
||||
var printQuery *bool = flag.BoolP("print-query", "p", false, "Print the query instead of executing it")
|
||||
flag.Parse()
|
||||
if *inputFile == "" || *outputFile == "" {
|
||||
log.Fatal("Input and output files are required")
|
||||
}
|
||||
if *noColors {
|
||||
settings.DisableColors()
|
||||
}
|
||||
lu.Compression = *compression
|
||||
lu.Debug = *debug
|
||||
err := parquet.CleanParquet(lu, *inputFile, *outputFile, *skipLineFormating, *deleteFirstRow, *printQuery)
|
||||
if err != nil {
|
||||
log.Fatal("Failed to clean Parquet file", "error", err)
|
||||
}
|
||||
return
|
||||
case "infoParquet":
|
||||
var inputFile *string = flag.StringP("input", "i", "", "Input Parquet file")
|
||||
var debug *bool = flag.Bool("debug", false, "Debug mode")
|
||||
var noColors *bool = flag.Bool("no-colors", false, "Remove all colors")
|
||||
flag.Parse()
|
||||
if *inputFile == "" {
|
||||
log.Fatal("Input files are required")
|
||||
}
|
||||
if *noColors {
|
||||
settings.DisableColors()
|
||||
}
|
||||
lu.Debug = *debug
|
||||
err := parquet.InfoParquet(lu, *inputFile)
|
||||
if err != nil {
|
||||
log.Fatal("Failed to read Parquet file", "error", err)
|
||||
}
|
||||
return
|
||||
case "csvToParquet":
|
||||
var inputFile *string = flag.StringP("input", "i", "", "Input Parquet file")
|
||||
var outputFile *string = flag.StringP("output", "o", "", "Output Parquet file")
|
||||
var strict *bool = flag.Bool("strict", true, "Strict mode for Duckdb")
|
||||
var compression *string = flag.StringP("compression", "c", "ZSTD", "Compression codec (UNCOMPRESSED, SNAPPY, GZIP, BROTLI, LZ4, ZSTD)")
|
||||
var noColors *bool = flag.Bool("no-colors", false, "Remove all colors")
|
||||
var debug *bool = flag.Bool("debug", false, "Debug mode")
|
||||
flag.Parse()
|
||||
if *inputFile == "" || *outputFile == "" {
|
||||
log.Fatal("Input and output files are required")
|
||||
}
|
||||
if *noColors {
|
||||
settings.DisableColors()
|
||||
}
|
||||
lu.Compression = *compression
|
||||
lu.Debug = *debug
|
||||
err := misc.CsvToParquet(lu, *inputFile, *outputFile, *strict)
|
||||
if err != nil {
|
||||
log.Fatal("Failed to transform Csv file", "error", err)
|
||||
}
|
||||
return
|
||||
case "mergeFiles":
|
||||
var inputFiles *[]string = flag.StringArrayP("inputs", "i", []string{}, "Input Parquet files")
|
||||
var outputFile *string = flag.StringP("output", "o", "", "Output Parquet file")
|
||||
var noColors *bool = flag.Bool("no-colors", false, "Remove all colors")
|
||||
var debug *bool = flag.Bool("debug", false, "Debug mode")
|
||||
flag.Parse()
|
||||
if len(*inputFiles) == 0 || *outputFile == "" {
|
||||
log.Fatal("Inputs and output files are required")
|
||||
}
|
||||
if *noColors {
|
||||
settings.DisableColors()
|
||||
}
|
||||
lu.Debug = *debug
|
||||
err := misc.MergeFiles(lu, *outputFile, *inputFiles...)
|
||||
if err != nil {
|
||||
log.Fatal("Failed to merge files", "error", err)
|
||||
}
|
||||
return
|
||||
case "removeUrlSchemeFromUlp":
|
||||
var inputFile *string = flag.StringP("input", "i", "", "Input Parquet file")
|
||||
var noColors *bool = flag.Bool("no-colors", false, "Remove all colors")
|
||||
var debug *bool = flag.Bool("debug", false, "Debug mode")
|
||||
flag.Parse()
|
||||
if *inputFile == "" {
|
||||
log.Fatal("Input files are required")
|
||||
}
|
||||
if *noColors {
|
||||
settings.DisableColors()
|
||||
}
|
||||
lu.Debug = *debug
|
||||
err := misc.RemoveUrlSchemeFromUlp(lu, *inputFile)
|
||||
if err != nil {
|
||||
log.Fatal("Failed to remove ULP Url schemes", "error", err)
|
||||
}
|
||||
return
|
||||
}
|
||||
}
|
||||
173
leak-utils/misc/csv.go
Normal file
173
leak-utils/misc/csv.go
Normal file
@@ -0,0 +1,173 @@
|
||||
package misc
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"encoding/csv"
|
||||
"fmt"
|
||||
"io"
|
||||
"os"
|
||||
"slices"
|
||||
"strings"
|
||||
|
||||
"github.com/anotherhadi/eleakxir/leak-utils/settings"
|
||||
"github.com/charmbracelet/log"
|
||||
)
|
||||
|
||||
func CsvToParquet(lu settings.LeakUtils, inputFile string, outputFile string, strict bool) error {
|
||||
hasHeader, err := csvHasHeader(inputFile)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
header := "true"
|
||||
if !hasHeader {
|
||||
header = "false"
|
||||
}
|
||||
strictMode := "true"
|
||||
if !strict {
|
||||
strictMode = "false"
|
||||
}
|
||||
|
||||
delimiter := getDelimiter(inputFile)
|
||||
|
||||
query := fmt.Sprintf(`CREATE TABLE my_table AS FROM read_csv_auto('%s', HEADER=%s, delim='%s', ignore_errors=true, all_varchar=true, null_padding=true, strict_mode=%s);
|
||||
COPY my_table TO '%s' (FORMAT 'parquet', COMPRESSION '%s', ROW_GROUP_SIZE 200_000);`,
|
||||
inputFile, header, delimiter, strictMode, outputFile, lu.Compression)
|
||||
|
||||
if lu.Debug {
|
||||
log.Info("Detected delimiter", "delimiter", delimiter)
|
||||
log.Info("CSV header detection", "hasHeader", hasHeader)
|
||||
log.Info("Executing query", "query", query)
|
||||
}
|
||||
|
||||
_, err = lu.Db.Exec(query)
|
||||
|
||||
if lu.Debug {
|
||||
log.Info("Finished executing query")
|
||||
}
|
||||
|
||||
return err
|
||||
}
|
||||
|
||||
func getDelimiter(inputFile string) string {
|
||||
lines, err := getNLine(inputFile, 10, 0)
|
||||
if err != nil {
|
||||
log.Warn("Failed to read CSV file to determine delimiter, defaulting to comma", "error", err)
|
||||
return ","
|
||||
}
|
||||
|
||||
delimiterCounts := map[string]int{
|
||||
",": 0,
|
||||
";": 0,
|
||||
"\t": 0,
|
||||
"|": 0,
|
||||
":": 0,
|
||||
}
|
||||
|
||||
for _, line := range lines {
|
||||
for d := range delimiterCounts {
|
||||
delimiterCounts[d] += strings.Count(line, d)
|
||||
}
|
||||
}
|
||||
|
||||
maxCount := 0
|
||||
delimiter := ","
|
||||
|
||||
for d, count := range delimiterCounts {
|
||||
if count > maxCount {
|
||||
maxCount = count
|
||||
delimiter = d
|
||||
}
|
||||
}
|
||||
|
||||
return delimiter
|
||||
}
|
||||
|
||||
func csvHasHeader(inputFile string) (hasHeader bool, err error) {
|
||||
firstRow, err := getFirstRowCsv(inputFile)
|
||||
if err != nil {
|
||||
return false, err
|
||||
}
|
||||
for i, col := range firstRow {
|
||||
col = strings.ReplaceAll(col, "\"", "")
|
||||
col = strings.ReplaceAll(col, " ", "")
|
||||
col = strings.ReplaceAll(col, "-", "")
|
||||
col = strings.ReplaceAll(col, "_", "")
|
||||
col = strings.ReplaceAll(col, ".", "")
|
||||
firstRow[i] = strings.ToLower(strings.TrimSpace(col))
|
||||
}
|
||||
knownHeaders := []string{"email", "password", "username", "phone", "lastname", "firstname"}
|
||||
for _, knownHeader := range knownHeaders {
|
||||
if slices.Contains(firstRow, knownHeader) {
|
||||
return true, nil
|
||||
}
|
||||
}
|
||||
return false, nil
|
||||
}
|
||||
|
||||
func getNLine(inputFile string, n, offset int) (lines []string, err error) {
|
||||
if n <= 0 {
|
||||
return nil, nil
|
||||
}
|
||||
|
||||
if offset < 0 {
|
||||
offset = 0
|
||||
}
|
||||
|
||||
file, err := os.Open(inputFile)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer file.Close()
|
||||
|
||||
scanner := bufio.NewScanner(file)
|
||||
currentLine := 0
|
||||
|
||||
for scanner.Scan() {
|
||||
currentLine++
|
||||
if currentLine <= offset {
|
||||
continue
|
||||
}
|
||||
|
||||
lines = append(lines, scanner.Text())
|
||||
if len(lines) >= n {
|
||||
break
|
||||
}
|
||||
}
|
||||
|
||||
if err := scanner.Err(); err != nil && err != io.EOF {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
return lines, nil
|
||||
}
|
||||
|
||||
func getFirstRowCsv(inputFile string) (row []string, err error) {
|
||||
rows, err := getFirstNRowsCsv(inputFile, 1)
|
||||
if len(rows) == 0 {
|
||||
return nil, fmt.Errorf("no rows found in CSV")
|
||||
}
|
||||
return rows[0], err
|
||||
}
|
||||
|
||||
func getFirstNRowsCsv(inputFile string, n int) (rows [][]string, err error) {
|
||||
f, err := os.Open(inputFile)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to open file: %w", err)
|
||||
}
|
||||
defer f.Close()
|
||||
|
||||
reader := csv.NewReader(f)
|
||||
|
||||
for i := 0; i < n; i++ {
|
||||
row, err := reader.Read()
|
||||
if err != nil {
|
||||
if err.Error() == "EOF" {
|
||||
break
|
||||
}
|
||||
return nil, fmt.Errorf("failed to read CSV: %w", err)
|
||||
}
|
||||
rows = append(rows, row)
|
||||
}
|
||||
|
||||
return rows, nil
|
||||
}
|
||||
31
leak-utils/misc/misc.go
Normal file
31
leak-utils/misc/misc.go
Normal file
@@ -0,0 +1,31 @@
|
||||
package misc
|
||||
|
||||
import (
|
||||
"io"
|
||||
"os"
|
||||
|
||||
"github.com/anotherhadi/eleakxir/leak-utils/settings"
|
||||
)
|
||||
|
||||
func MergeFiles(lu settings.LeakUtils, outputFile string, inputFiles ...string) error {
|
||||
out, err := os.Create(outputFile)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer out.Close()
|
||||
|
||||
for _, inputFile := range inputFiles {
|
||||
file, err := os.Open(inputFile)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer file.Close()
|
||||
|
||||
_, err = io.Copy(out, file)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
67
leak-utils/misc/ulp.go
Normal file
67
leak-utils/misc/ulp.go
Normal file
@@ -0,0 +1,67 @@
|
||||
package misc
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"io"
|
||||
"os"
|
||||
"strings"
|
||||
|
||||
"github.com/anotherhadi/eleakxir/leak-utils/settings"
|
||||
)
|
||||
|
||||
func RemoveUrlSchemeFromUlp(lu settings.LeakUtils, inputFile string) error {
|
||||
file, err := os.Open(inputFile)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer file.Close()
|
||||
|
||||
outputFile := inputFile + ".clean"
|
||||
out, err := os.Create(outputFile)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer out.Close()
|
||||
|
||||
reader := bufio.NewReader(file)
|
||||
writer := bufio.NewWriter(out)
|
||||
|
||||
for {
|
||||
line, err := reader.ReadString('\n')
|
||||
if err != nil && err != io.EOF {
|
||||
return err
|
||||
}
|
||||
|
||||
firstColumn := strings.Index(line, ":")
|
||||
firstScheme := strings.Index(line, "://")
|
||||
if firstScheme != -1 && firstColumn == firstScheme {
|
||||
line = line[firstScheme+3:]
|
||||
}
|
||||
|
||||
_, werr := writer.WriteString(line)
|
||||
if werr != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
if err == io.EOF {
|
||||
break
|
||||
}
|
||||
}
|
||||
|
||||
err = writer.Flush()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
err = os.Remove(inputFile)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
err = os.Rename(outputFile, inputFile)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
107
leak-utils/parquet/format.go
Normal file
107
leak-utils/parquet/format.go
Normal file
@@ -0,0 +1,107 @@
|
||||
package parquet
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
|
||||
"github.com/anotherhadi/eleakxir/leak-utils/settings"
|
||||
)
|
||||
|
||||
// If there is no full_name but there is last_name and first_name, create full_name
|
||||
// If there is no full_name, no last_name or no first_name, but there is name, rename name to full_name
|
||||
func addFullname(operations []ColumnOperation) []ColumnOperation {
|
||||
hasFullName := false
|
||||
hasFirstName := false
|
||||
hasLastName := false
|
||||
hasName := false
|
||||
for _, op := range operations {
|
||||
if op.Action != "drop" {
|
||||
if op.NewName == "full_name" {
|
||||
hasFullName = true
|
||||
} else if op.NewName == "first_name" {
|
||||
hasFirstName = true
|
||||
} else if op.NewName == "last_name" {
|
||||
hasLastName = true
|
||||
} else if op.NewName == "name" {
|
||||
hasName = true
|
||||
}
|
||||
}
|
||||
}
|
||||
if hasFullName {
|
||||
return operations
|
||||
}
|
||||
if hasFirstName && hasLastName {
|
||||
operations = append(operations, ColumnOperation{
|
||||
OriginalName: "first_name || ' ' || last_name",
|
||||
NewName: "full_name",
|
||||
Action: "rename",
|
||||
})
|
||||
fmt.Println(settings.Muted.Render("\nAdding new column 'full_name' as concatenation of 'first_name' and 'last_name'."))
|
||||
return operations
|
||||
}
|
||||
if hasName {
|
||||
for i, op := range operations {
|
||||
if op.NewName == "name" && op.Action != "drop" {
|
||||
operations[i].NewName = "full_name"
|
||||
fmt.Println(settings.Muted.Render("\nRenaming column 'name' to 'full_name'."))
|
||||
return operations
|
||||
}
|
||||
}
|
||||
}
|
||||
if hasFirstName {
|
||||
operations = append(operations, ColumnOperation{
|
||||
OriginalName: "first_name",
|
||||
NewName: "full_name",
|
||||
Action: "rename",
|
||||
})
|
||||
fmt.Println(settings.Muted.Render("\nAdding new column 'full_name' from 'first_name'."))
|
||||
return operations
|
||||
}
|
||||
if hasLastName {
|
||||
operations = append(operations, ColumnOperation{
|
||||
OriginalName: "last_name",
|
||||
NewName: "full_name",
|
||||
Action: "rename",
|
||||
})
|
||||
fmt.Println(settings.Muted.Render("\nAdding new column 'full_name' from 'last_name'."))
|
||||
return operations
|
||||
}
|
||||
|
||||
return operations
|
||||
}
|
||||
|
||||
// formatColumnName formats a column name to be SQL-compliant.
|
||||
func formatColumnName(columnName string) string {
|
||||
columnName = strings.TrimSpace(columnName)
|
||||
columnName = strings.ToLower(columnName)
|
||||
columnName = strings.Join(strings.Fields(columnName), "_")
|
||||
columnName = strings.ReplaceAll(columnName, "\"", "")
|
||||
columnName = strings.ReplaceAll(columnName, "'", "")
|
||||
columnName = strings.ReplaceAll(columnName, " ", "_")
|
||||
columnName = strings.ReplaceAll(columnName, "-", "_")
|
||||
// Only keep a-z, 0-9 and _
|
||||
var formatted strings.Builder
|
||||
for _, r := range columnName {
|
||||
if (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9') || r == '_' {
|
||||
formatted.WriteRune(r)
|
||||
}
|
||||
}
|
||||
columnName = formatted.String()
|
||||
columnName = strings.TrimPrefix(columnName, "_")
|
||||
columnName = strings.TrimSuffix(columnName, "_")
|
||||
return columnName
|
||||
}
|
||||
|
||||
// formatColumns applies specific formatting rules to column operations.
|
||||
func formatColumns(operations []ColumnOperation) []ColumnOperation {
|
||||
formatedOperations := []ColumnOperation{}
|
||||
for _, op := range operations {
|
||||
if op.NewName == "phone" || strings.HasSuffix(op.NewName, "_phone") {
|
||||
op.OriginalName = "REGEXP_REPLACE(" + op.OriginalName + ", '[^0-9]', '')"
|
||||
} else if op.NewName == "email" || strings.HasSuffix(op.NewName, "_email") {
|
||||
op.OriginalName = "REGEXP_REPLACE(LOWER(TRIM(" + op.OriginalName + ")), '[^a-z0-9._@-]', '')"
|
||||
}
|
||||
formatedOperations = append(formatedOperations, op)
|
||||
}
|
||||
return formatedOperations
|
||||
}
|
||||
276
leak-utils/parquet/parquet.go
Normal file
276
leak-utils/parquet/parquet.go
Normal file
@@ -0,0 +1,276 @@
|
||||
package parquet
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"database/sql"
|
||||
"fmt"
|
||||
"os"
|
||||
"strings"
|
||||
|
||||
"github.com/anotherhadi/eleakxir/leak-utils/settings"
|
||||
"github.com/charmbracelet/log"
|
||||
)
|
||||
|
||||
type Parquet struct {
|
||||
Filepath string
|
||||
Filename string
|
||||
Columns []string
|
||||
Sample [][]string
|
||||
NRows int64
|
||||
Compression string // Compression of the output file (e.g., "SNAPPY", "ZSTD", "NONE" or "")
|
||||
}
|
||||
|
||||
type ColumnOperation struct {
|
||||
OriginalName string
|
||||
NewName string
|
||||
Action string // "keep", "rename", "drop"
|
||||
}
|
||||
|
||||
func (parquet Parquet) PrintParquet() {
|
||||
fmt.Println(settings.Header.Render(parquet.Filename) + "\n")
|
||||
fmt.Println(settings.Accent.Render("File path:"), settings.Base.Render(parquet.Filepath))
|
||||
fmt.Println(settings.Accent.Render("Number of columns:"), settings.Base.Render(fmt.Sprintf("%d", len(parquet.Columns))))
|
||||
fmt.Println(settings.Accent.Render("Number of rows:"), settings.Base.Render(formatWithSpaces(parquet.NRows)))
|
||||
fmt.Println()
|
||||
fmt.Println(settings.Accent.Render(strings.Join(parquet.Columns, " | ")))
|
||||
for _, row := range parquet.Sample {
|
||||
fmt.Println(settings.Base.Render(strings.Join(row, " | ")))
|
||||
}
|
||||
}
|
||||
|
||||
func InfoParquet(lu settings.LeakUtils, inputFile string) error {
|
||||
parquet, err := GetParquet(lu.Db, inputFile)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
parquet.PrintParquet()
|
||||
return nil
|
||||
}
|
||||
|
||||
func CleanParquet(lu settings.LeakUtils, inputFile, outputFile string, skipLineFormating, deleteFirstRow, printQuery bool) error {
|
||||
input, err := GetParquet(lu.Db, inputFile)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
input.PrintParquet()
|
||||
columnOps := configureColumns(*input, skipLineFormating)
|
||||
output := Parquet{
|
||||
Filepath: outputFile,
|
||||
Compression: lu.Compression,
|
||||
}
|
||||
err = transformParquet(lu, *input, output, columnOps, deleteFirstRow, printQuery)
|
||||
return err
|
||||
}
|
||||
|
||||
func configureColumns(input Parquet, skipLineFormating bool) []ColumnOperation {
|
||||
reader := bufio.NewReader(os.Stdin)
|
||||
var operations []ColumnOperation
|
||||
|
||||
fmt.Println()
|
||||
fmt.Println(settings.Base.Render("For each column, choose an action:"))
|
||||
fmt.Println(settings.Base.Render(" [k] Keep"))
|
||||
fmt.Println(settings.Base.Render(" [r] Rename"))
|
||||
fmt.Println(settings.Base.Render(" [d] Drop/Delete"))
|
||||
fmt.Println(settings.Base.Render(" [s] Suggested"))
|
||||
fmt.Println(settings.Base.Render(" [b] Go back"))
|
||||
fmt.Println()
|
||||
|
||||
for i := 0; i < len(input.Columns); i++ {
|
||||
col := input.Columns[i]
|
||||
suggestion := getSuggestion(col)
|
||||
|
||||
for {
|
||||
fmt.Println(settings.Muted.Render("\nColumn:"), settings.Accent.Render(col))
|
||||
if suggestion != "" {
|
||||
fmt.Println(settings.Alert.Render("Suggested action: Rename to '" + suggestion + "'"))
|
||||
}
|
||||
fmt.Print(settings.Base.Render("[k/r/d/s/b]: "))
|
||||
|
||||
input, err := reader.ReadString('\n')
|
||||
if err != nil {
|
||||
log.Printf("Error reading input: %v", err)
|
||||
continue
|
||||
}
|
||||
input = strings.TrimSpace(strings.ToLower(input))
|
||||
|
||||
op := ColumnOperation{
|
||||
OriginalName: col,
|
||||
NewName: col,
|
||||
Action: "keep",
|
||||
}
|
||||
|
||||
switch input {
|
||||
case "b", "back":
|
||||
if i > 0 {
|
||||
i -= 2
|
||||
if len(operations) > 0 {
|
||||
operations = operations[:len(operations)-1]
|
||||
}
|
||||
fmt.Println(settings.Muted.Render("Going back to the previous column..."))
|
||||
} else {
|
||||
fmt.Println(settings.Muted.Render("Already at the first column, cannot go back further."))
|
||||
continue
|
||||
}
|
||||
goto nextColumn
|
||||
|
||||
case "r", "rename":
|
||||
fmt.Print(settings.Base.Render("Enter new name: "))
|
||||
newName, err := reader.ReadString('\n')
|
||||
if err != nil {
|
||||
log.Printf("Error reading new name: %v", err)
|
||||
continue
|
||||
}
|
||||
newName = strings.TrimSpace(newName)
|
||||
if newName != "" {
|
||||
op.OriginalName = "\"" + op.OriginalName + "\""
|
||||
op.NewName = formatColumnName(newName)
|
||||
op.Action = "rename"
|
||||
operations = append(operations, op)
|
||||
goto nextColumn
|
||||
} else {
|
||||
fmt.Println(settings.Muted.Render("Invalid name, please try again."))
|
||||
continue
|
||||
}
|
||||
|
||||
case "s", "suggested":
|
||||
if suggestion != "" {
|
||||
op.OriginalName = "\"" + op.OriginalName + "\""
|
||||
op.NewName = formatColumnName(suggestion)
|
||||
op.Action = "rename"
|
||||
} else {
|
||||
fmt.Println(settings.Muted.Render("No valid suggestion available"))
|
||||
continue
|
||||
}
|
||||
operations = append(operations, op)
|
||||
goto nextColumn
|
||||
|
||||
case "d", "drop", "delete":
|
||||
op.Action = "drop"
|
||||
operations = append(operations, op)
|
||||
goto nextColumn
|
||||
|
||||
case "k", "keep", "":
|
||||
op.OriginalName = "\"" + op.OriginalName + "\""
|
||||
op.NewName = formatColumnName(op.NewName)
|
||||
op.Action = "rename"
|
||||
operations = append(operations, op)
|
||||
goto nextColumn
|
||||
|
||||
default:
|
||||
fmt.Println(settings.Muted.Render("Invalid choice, please enter [k/r/d/s/b]."))
|
||||
continue
|
||||
}
|
||||
}
|
||||
nextColumn:
|
||||
lastOp := operations[len(operations)-1]
|
||||
switch lastOp.Action {
|
||||
case "rename":
|
||||
if formatColumnName(lastOp.OriginalName) == lastOp.NewName {
|
||||
fmt.Printf(settings.Muted.Render("Keeping column '%s' as is.\n"), lastOp.OriginalName)
|
||||
} else {
|
||||
fmt.Printf(settings.Muted.Render("Renaming column '%s' to '%s'.\n"), lastOp.OriginalName, lastOp.NewName)
|
||||
}
|
||||
case "drop":
|
||||
fmt.Printf(settings.Muted.Render("Dropping column '%s'.\n"), lastOp.OriginalName)
|
||||
}
|
||||
}
|
||||
if !skipLineFormating {
|
||||
operations = formatColumns(operations)
|
||||
}
|
||||
operations = addFullname(operations)
|
||||
|
||||
return operations
|
||||
}
|
||||
|
||||
func transformParquet(lu settings.LeakUtils, input, output Parquet, operations []ColumnOperation, deleteFirstRow, printQuery bool) error {
|
||||
var selectClauses []string
|
||||
hasColumns := false
|
||||
|
||||
for _, op := range operations {
|
||||
if op.Action != "drop" {
|
||||
hasColumns = true
|
||||
if op.Action == "rename" {
|
||||
selectClauses = append(selectClauses, fmt.Sprintf("%s AS \"%s\"", op.OriginalName, op.NewName))
|
||||
} else {
|
||||
selectClauses = append(selectClauses, op.OriginalName)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if !hasColumns {
|
||||
return fmt.Errorf("no columns selected for output")
|
||||
}
|
||||
|
||||
selectClause := strings.Join(selectClauses, ", ")
|
||||
compression := ""
|
||||
if output.Compression != "" {
|
||||
compression = ", COMPRESSION '" + output.Compression + "'"
|
||||
}
|
||||
|
||||
columnsLength := []string{}
|
||||
for _, col := range input.Columns {
|
||||
columnsLength = append(columnsLength, "COALESCE(LENGTH(\""+col+"\"),0)")
|
||||
}
|
||||
allowedRowSize := 30 * len(input.Columns)
|
||||
offset := ""
|
||||
if deleteFirstRow {
|
||||
offset = "OFFSET 1"
|
||||
}
|
||||
|
||||
query := fmt.Sprintf(`
|
||||
COPY (
|
||||
SELECT %s
|
||||
FROM read_parquet('%s')
|
||||
WHERE (%s) < %d
|
||||
%s
|
||||
) TO '%s' (FORMAT PARQUET, ROW_GROUP_SIZE 200_000 %s)
|
||||
`, selectClause, input.Filepath, strings.Join(columnsLength, "+"), allowedRowSize, offset, output.Filepath, compression)
|
||||
|
||||
if printQuery {
|
||||
fmt.Println("Query:", query) // TODO: Remove tabs
|
||||
return nil
|
||||
}
|
||||
|
||||
fmt.Println(settings.Base.Render("\nTransforming and writing to output parquet..."))
|
||||
_, err := lu.Db.Exec(query)
|
||||
if err != nil {
|
||||
return fmt.Errorf("failed to execute transformation: %w", err)
|
||||
}
|
||||
fmt.Println(settings.Base.Render("Transformation complete!\n"))
|
||||
|
||||
newParquet, err := GetParquet(lu.Db, output.Filepath)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
newParquet.PrintParquet()
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func GetParquet(db *sql.DB, inputFile string) (parquet *Parquet, err error) {
|
||||
parquet = &Parquet{}
|
||||
parquet.Filepath = inputFile
|
||||
|
||||
parquet.Columns, err = getColumns(db, inputFile)
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
parquet.NRows, err = countRows(db, inputFile)
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
parquet.Sample, err = getFirstNRows(db, inputFile, 6)
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
|
||||
n := strings.LastIndex(inputFile, "/")
|
||||
if n == -1 {
|
||||
parquet.Filename = inputFile
|
||||
} else {
|
||||
parquet.Filename = inputFile[n+1:]
|
||||
}
|
||||
|
||||
return
|
||||
}
|
||||
81
leak-utils/parquet/suggestions.go
Normal file
81
leak-utils/parquet/suggestions.go
Normal file
@@ -0,0 +1,81 @@
|
||||
package parquet
|
||||
|
||||
import (
|
||||
"slices"
|
||||
)
|
||||
|
||||
func getSuggestion(col string) string {
|
||||
col = formatColumnName(col)
|
||||
knownNames := []string{
|
||||
"date",
|
||||
"phone",
|
||||
"username",
|
||||
"address",
|
||||
"email",
|
||||
"postal_code",
|
||||
"city",
|
||||
"country",
|
||||
"state",
|
||||
"age",
|
||||
"gender",
|
||||
"password",
|
||||
"password_hash",
|
||||
"full_name",
|
||||
"last_name",
|
||||
"name", // Will be renamed to full_name later
|
||||
"first_name",
|
||||
"birth_date",
|
||||
"url",
|
||||
"ip",
|
||||
}
|
||||
if slices.Contains(knownNames, col) {
|
||||
return col
|
||||
}
|
||||
if col == "user" {
|
||||
return "username"
|
||||
}
|
||||
if col == "login" {
|
||||
return "username"
|
||||
}
|
||||
if col == "sex" {
|
||||
return "gender"
|
||||
}
|
||||
if col == "ip_address" {
|
||||
return "ip"
|
||||
}
|
||||
if col == "password_hashed" {
|
||||
return "password_hash"
|
||||
}
|
||||
if col == "firstname" {
|
||||
return "first_name"
|
||||
}
|
||||
if col == "lastname" {
|
||||
return "last_name"
|
||||
}
|
||||
if col == "fullname" {
|
||||
return "full_name"
|
||||
}
|
||||
if col == "mail" {
|
||||
return "email"
|
||||
}
|
||||
if col == "zip" || col == "postalcode" || col == "zipcode" || col == "postal" || col == "zip_code" {
|
||||
return "postal_code"
|
||||
}
|
||||
if col == "street_address" {
|
||||
return "address"
|
||||
}
|
||||
if col == "hash" || col == "hashed_password" || col == "hash_password" {
|
||||
return "password_hash"
|
||||
}
|
||||
if col == "birthdate" || col == "dob" || col == "date_of_birth" {
|
||||
return "birth_date"
|
||||
}
|
||||
|
||||
return ""
|
||||
}
|
||||
|
||||
// HINTS:
|
||||
// date: _date
|
||||
// url: _url, link
|
||||
// address: _address
|
||||
//
|
||||
105
leak-utils/parquet/utils.go
Normal file
105
leak-utils/parquet/utils.go
Normal file
@@ -0,0 +1,105 @@
|
||||
package parquet
|
||||
|
||||
import (
|
||||
"database/sql"
|
||||
"fmt"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
// getColumns retrieves the column names from the Parquet file.
|
||||
func getColumns(db *sql.DB, filepath string) ([]string, error) {
|
||||
// Create a view from the parquet file
|
||||
query := fmt.Sprintf("CREATE OR REPLACE VIEW parquet_view AS SELECT * FROM read_parquet('%s')", filepath)
|
||||
_, err := db.Exec(query)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to create view: %w", err)
|
||||
}
|
||||
|
||||
// Get column information
|
||||
rows, err := db.Query("DESCRIBE parquet_view")
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to describe view: %w", err)
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
var columns []string
|
||||
for rows.Next() {
|
||||
var colName, colType, nullable, key, defaultVal, extra sql.NullString
|
||||
err := rows.Scan(&colName, &colType, &nullable, &key, &defaultVal, &extra)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to scan row: %w", err)
|
||||
}
|
||||
if colName.Valid {
|
||||
columns = append(columns, colName.String)
|
||||
}
|
||||
}
|
||||
|
||||
return columns, nil
|
||||
}
|
||||
|
||||
// getFirstNRows retrieves the first N rows from the Parquet file.
|
||||
func getFirstNRows(db *sql.DB, inputFile string, n int) ([][]string, error) {
|
||||
query := fmt.Sprintf("SELECT * FROM read_parquet('%s') LIMIT %d", inputFile, n)
|
||||
rows, err := db.Query(query)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to query parquet file: %w", err)
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
cols, err := rows.Columns()
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to get columns: %w", err)
|
||||
}
|
||||
|
||||
var results [][]string
|
||||
for rows.Next() {
|
||||
values := make([]sql.NullString, len(cols))
|
||||
valuePtrs := make([]any, len(cols))
|
||||
for i := range values {
|
||||
valuePtrs[i] = &values[i]
|
||||
}
|
||||
|
||||
err := rows.Scan(valuePtrs...)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to scan row: %w", err)
|
||||
}
|
||||
|
||||
var row []string
|
||||
for _, val := range values {
|
||||
if val.Valid {
|
||||
row = append(row, val.String)
|
||||
} else {
|
||||
row = append(row, "NULL")
|
||||
}
|
||||
}
|
||||
results = append(results, row)
|
||||
}
|
||||
|
||||
return results, nil
|
||||
}
|
||||
|
||||
// countRows counts the number of rows in the Parquet file.
|
||||
func countRows(db *sql.DB, inputFile string) (int64, error) {
|
||||
var count int64
|
||||
err := db.QueryRow(fmt.Sprintf("SELECT COUNT(*) FROM read_parquet('%s')", inputFile)).Scan(&count)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("failed to count rows: %w", err)
|
||||
}
|
||||
return count, nil
|
||||
}
|
||||
|
||||
// formatWithSpaces formats an integer with spaces as thousand separators.
|
||||
func formatWithSpaces(n int64) string {
|
||||
s := strconv.FormatInt(n, 10)
|
||||
|
||||
var b strings.Builder
|
||||
l := len(s)
|
||||
for i, c := range s {
|
||||
if i != 0 && (l-i)%3 == 0 {
|
||||
b.WriteRune(' ')
|
||||
}
|
||||
b.WriteRune(c)
|
||||
}
|
||||
return b.String()
|
||||
}
|
||||
27
leak-utils/settings/colors.go
Normal file
27
leak-utils/settings/colors.go
Normal file
@@ -0,0 +1,27 @@
|
||||
package settings
|
||||
|
||||
import (
|
||||
"github.com/charmbracelet/lipgloss/v2"
|
||||
)
|
||||
|
||||
var (
|
||||
purple = lipgloss.Color("99")
|
||||
lightPurple = lipgloss.Color("98")
|
||||
yellow = lipgloss.Color("220")
|
||||
gray = lipgloss.Color("245")
|
||||
lightGray = lipgloss.Color("241")
|
||||
|
||||
Header = lipgloss.NewStyle().Foreground(purple).Bold(true)
|
||||
Accent = lipgloss.NewStyle().Foreground(lightPurple)
|
||||
Base = lipgloss.NewStyle().Foreground(lightGray)
|
||||
Alert = lipgloss.NewStyle().Foreground(yellow).Bold(true)
|
||||
Muted = lipgloss.NewStyle().Foreground(gray)
|
||||
)
|
||||
|
||||
func DisableColors() {
|
||||
Header = lipgloss.NewStyle()
|
||||
Accent = lipgloss.NewStyle()
|
||||
Base = lipgloss.NewStyle()
|
||||
Alert = lipgloss.NewStyle()
|
||||
Muted = lipgloss.NewStyle()
|
||||
}
|
||||
9
leak-utils/settings/settings.go
Normal file
9
leak-utils/settings/settings.go
Normal file
@@ -0,0 +1,9 @@
|
||||
package settings
|
||||
|
||||
import "database/sql"
|
||||
|
||||
type LeakUtils struct {
|
||||
Debug bool
|
||||
Compression string // Compression of the output file (e.g., "SNAPPY", "ZSTD", "NONE" or "")
|
||||
Db *sql.DB
|
||||
}
|
||||
Reference in New Issue
Block a user