This commit is contained in:
Hadi
2025-09-24 17:20:03 +02:00
commit b9fbed9a54
83 changed files with 6241 additions and 0 deletions

View File

@@ -0,0 +1,138 @@
# Rules for handling Data Leaks
This normalization framework is designed to standardize data leaks for
[Eleakxir](https://github.com/anotherhadi/eleakxir), the open-source search
engine, using
[leak-utils](https://github.com/anotherhadi/eleakxir-temp/blob/main/leak-utils/README.md),
a dedicated CLI tool that converts and cleans files for efficient indexing and
searching.
## The Relevance of Parquet for Data Leaks
Parquet is an efficient, open-source columnar storage file format designed to
handle complex data in bulk. When dealing with data leaks, its choice is highly
relevant for several reasons:
- **Compression**: Parquet files offer superior compression compared to
row-based formats like CSV. By storing data column by column, it applies more
effective compression algorithms, which significantly reduces disk space. For
data leaks, where file sizes can range from gigabytes to terabytes, this is
crucial for minimizing storage costs.
- **Query Performance**: As a columnar format, Parquet allows you to read only
the specific columns you need for a query. In a data leak, you might only be
interested in emails and passwords, not full addresses or phone numbers. This
selective reading drastically speeds up search operations, as the system
doesn't have to scan through entire rows of irrelevant data.
- **Efficiency**: The format is optimized for analytics. It stores data with
metadata and statistics (min/max values) for each column, allowing for query
**pruning**. This means a query can skip entire blocks of data that don't
match the filtering criteria, boosting performance even further.
## Disclaimer
The information in this document is provided **for research and educational
purposes only**. I am **not responsible** for how this data, methods, or
guidelines are used. Any misuse, unlawful activity, or harm resulting from
applying this content is the sole responsibility of the individual or
organization using it.
## File Naming Convention
- **Lowercase only**, ASCII (no accents).
- **Separators**:
- `_` inside blocks (`date_2023_10`)
- `-` between blocks (`instagram.com-date_2023_10`)
- **Prefix**: always start with the **source name/url** (e.g., `instagram.com`,
`alien_txt`).
- **Blocks**: each additional part must be prefixed by its block name:
- `date_YYYY[_MM[_DD]]` → use ISO format (year, or year-month, or full date).
- `source_*` → origin of the leak (e.g., `scrape`, `dump`, `combo`).
- `version_v*` → versioning if regenerated or transformed.
- `notes_*` → optional clarifications.
- **Extension**: always `.parquet`.
**Recommended pattern:**
```txt
{source}-date_{YYYY[_MM[_DD]]}-source_{origin}-version_{vN}-notes_{info}.parquet
```
**Examples:**
```txt
instagram.com-date_2023_10.parquet
alien_txt-date_2022-source_dump.parquet
combo_french-notes_crypto.parquet
```
## Column Naming Convention
- **snake\_case only** (lowercase, `_` separator).
- **No dots (`.`)** in column names (`husband.phone``husband_phone`).
- **Allowed characters**: `[a-z0-9_]+` (no spaces, hyphens, or accents).
- **Multiple variants of the same field**:
- Relations → prefix clearly: `husband_phone`, `mother_last_name`.
- Multiples of the same type → numbered prefix: `1_phone`, `2_phone`,
`3_phone`.
- Always end with the column "type" (e.g., `_phone`, `_last_name`).
- **Rename if mislabeled**: If a `username` column actually contains only emails
rename it to `email`.
- **Remove irrelevant columns**: Drop meaningless identifiers like `id` or
fields with no analytical value.
- **Standard columns**: to enable schema alignment across leaks:
| Column |
| ------------- |
| email |
| username |
| password |
| password_hash |
| phone |
| date |
| birth_date |
| age |
| first_name |
| last_name |
| full_name |
| address |
| city |
| country |
| state |
| postal_code |
| ip |
| url |
| city |
## Standard Column Formatting
- **Email**: lowercase, trimmed, keep only `[^a-z0-9._@-]`.
- **Phone**: keep only `[^0-9]`
- **Names**:
- Keep `first_name` / `last_name` if present.
- Generate `full_name = CONCAT(first_name, ' ', last_name)`.
- If only `name` exists, rename it to `full_name`.
- **Passwords**:
- Hashes → `password_hash`.
- Plaintext → `password`.
- Never mix hashes and plaintext in the same column.
- **NULLs**: always use SQL `NULL` (never `""` or `"NULL"`).
## Deduplication
Deduplication is often **impractical at scale** (billions of rows). Do **not**
attempt to deduplicate at ingestion time. Instead, handle deduplication **after
running a search** to optimize performance and storage.

107
leak-utils/README.md Normal file
View File

@@ -0,0 +1,107 @@
# 🛠 leak-utils: The Eleakxir Data Utility Toolkit
`leak-utils` is a powerful command-line tool built to help you manage, process,
and optimize data leaks for use with the **Eleakxir** search engine. It provides
a suite of utilities for data cleaning, format conversion, and file
manipulation, all designed to ensure your data wells are efficient and
standardized.
`leak-utils` is written in **Go** and leverages **DuckDB** for its
high-performance in-memory processing, ensuring fast and reliable operations on
large datasets.
## 🚀 Features
- **Parquet File Management**: Clean and inspect existing `.parquet` files.
- **Format Conversion**: Seamlessly convert `.csv`, `.txt`, `.json` files into
the optimized `.parquet` format.
- **Schema Uniformity**: Tools designed to help you standardize and normalize
your data to align with the
[Eleakxir data leak normalization rules](./DATALEAKS-NORMALIZATION.md). This
ensures a consistent schema across all your files, which is crucial for
efficient searching and consistent results.
- **High Performance**: Built with Go and DuckDB for fast and efficient data
processing.
## ⚙️ How to Use
The tool operates via a single executable with different commands, each
corresponding to a specific action. You can find the executable in the
`leak-utils` directory of the Eleakxir project.
### Install
#### With go
```bash
go install "github.com/anotherhadi/eleakxir/leak-utils@latest"
```
#### With Nix/NixOS
<details>
<summary>Click to expand</summary>
**From anywhere (using the repo URL):**
```bash
nix run "github:anotherhadi/eleakxir#leak-utils" -- action [--flags value]
```
**Permanent Installation:**
```bash
# add the flake to your flake.nix
{
inputs = {
eleakxir.url = "github:anotherhadi/eleakxir";
};
}
# then add it to your packages
environment.systemPackages = with pkgs; [ # or home.packages
eleakxir.packages.${pkgs.system}.leak-utils
];
```
</details>
### Available Actions
#### `cleanParquet`
Optimizes and cleans an existing Parquet file. This can be used to change
columns, clean rows, ...
See:
```bash
leak-utils cleanParquet --help
```
#### `infoParquet`
Displays metadata and schema information for a given Parquet file. Useful for
inspecting file structure and column types.
#### `csvToParquet`
Converts a `.csv` file into a highly compressed and efficient `.parquet` file.
This is the recommended way to prepare your data for Eleakxir.
#### `mergeFiles`
Merges multiple files (of the same type) into a single, larger file. This is
useful for combining smaller data leaks.
#### `removeUrlSchemeFromUlp`
This utility prevents the colon (`:`) in URL schemes like `https://` from being
mistakenly parsed as a column separator when processing ULP data in flat files
like CSV or TXT.
## 🤝 Contributing
[Contributions](../CONTRIBUTING.md) to `leak-utils` are welcome! Feel free to
open issues or submit pull requests for new features, bug fixes, or performance
improvements.

42
leak-utils/go.mod Normal file
View File

@@ -0,0 +1,42 @@
module github.com/anotherhadi/eleakxir/leak-utils
go 1.25.0
require (
github.com/charmbracelet/lipgloss/v2 v2.0.0-beta1
github.com/charmbracelet/log v0.4.2
github.com/marcboeker/go-duckdb v1.8.5
github.com/spf13/pflag v1.0.10
)
require (
github.com/apache/arrow-go/v18 v18.1.0 // indirect
github.com/aymanbagabas/go-osc52/v2 v2.0.1 // indirect
github.com/charmbracelet/colorprofile v0.3.0 // indirect
github.com/charmbracelet/lipgloss v1.1.0 // indirect
github.com/charmbracelet/x/ansi v0.8.0 // indirect
github.com/charmbracelet/x/cellbuf v0.0.13 // indirect
github.com/charmbracelet/x/term v0.2.1 // indirect
github.com/go-logfmt/logfmt v0.6.0 // indirect
github.com/go-viper/mapstructure/v2 v2.2.1 // indirect
github.com/goccy/go-json v0.10.5 // indirect
github.com/google/flatbuffers v25.1.24+incompatible // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/klauspost/compress v1.17.11 // indirect
github.com/klauspost/cpuid/v2 v2.2.9 // indirect
github.com/lucasb-eyer/go-colorful v1.2.0 // indirect
github.com/mattn/go-isatty v0.0.20 // indirect
github.com/mattn/go-runewidth v0.0.16 // indirect
github.com/muesli/cancelreader v0.2.2 // indirect
github.com/muesli/termenv v0.16.0 // indirect
github.com/pierrec/lz4/v4 v4.1.22 // indirect
github.com/rivo/uniseg v0.4.7 // indirect
github.com/xo/terminfo v0.0.0-20220910002029-abceb7e1c41e // indirect
github.com/zeebo/xxh3 v1.0.2 // indirect
golang.org/x/exp v0.0.0-20250128182459-e0ece0dbea4c // indirect
golang.org/x/mod v0.22.0 // indirect
golang.org/x/sync v0.10.0 // indirect
golang.org/x/sys v0.31.0 // indirect
golang.org/x/tools v0.29.0 // indirect
golang.org/x/xerrors v0.0.0-20240903120638-7835f813f4da // indirect
)

94
leak-utils/go.sum Normal file
View File

@@ -0,0 +1,94 @@
github.com/andybalholm/brotli v1.1.1 h1:PR2pgnyFznKEugtsUo0xLdDop5SKXd5Qf5ysW+7XdTA=
github.com/andybalholm/brotli v1.1.1/go.mod h1:05ib4cKhjx3OQYUY22hTVd34Bc8upXjOLL2rKwwZBoA=
github.com/apache/arrow-go/v18 v18.1.0 h1:agLwJUiVuwXZdwPYVrlITfx7bndULJ/dggbnLFgDp/Y=
github.com/apache/arrow-go/v18 v18.1.0/go.mod h1:tigU/sIgKNXaesf5d7Y95jBBKS5KsxTqYBKXFsvKzo0=
github.com/apache/thrift v0.21.0 h1:tdPmh/ptjE1IJnhbhrcl2++TauVjy242rkV/UzJChnE=
github.com/apache/thrift v0.21.0/go.mod h1:W1H8aR/QRtYNvrPeFXBtobyRkd0/YVhTc6i07XIAgDw=
github.com/aymanbagabas/go-osc52/v2 v2.0.1 h1:HwpRHbFMcZLEVr42D4p7XBqjyuxQH5SMiErDT4WkJ2k=
github.com/aymanbagabas/go-osc52/v2 v2.0.1/go.mod h1:uYgXzlJ7ZpABp8OJ+exZzJJhRNQ2ASbcXHWsFqH8hp8=
github.com/charmbracelet/colorprofile v0.3.0 h1:KtLh9uuu1RCt+Hml4s6Hz+kB1PfV3wi++1h5ia65yKQ=
github.com/charmbracelet/colorprofile v0.3.0/go.mod h1:oHJ340RS2nmG1zRGPmhJKJ/jf4FPNNk0P39/wBPA1G0=
github.com/charmbracelet/lipgloss v1.1.0 h1:vYXsiLHVkK7fp74RkV7b2kq9+zDLoEU4MZoFqR/noCY=
github.com/charmbracelet/lipgloss v1.1.0/go.mod h1:/6Q8FR2o+kj8rz4Dq0zQc3vYf7X+B0binUUBwA0aL30=
github.com/charmbracelet/lipgloss/v2 v2.0.0-beta1 h1:SOylT6+BQzPHEjn15TIzawBPVD0QmhKXbcb3jY0ZIKU=
github.com/charmbracelet/lipgloss/v2 v2.0.0-beta1/go.mod h1:tRlx/Hu0lo/j9viunCN2H+Ze6JrmdjQlXUQvvArgaOc=
github.com/charmbracelet/log v0.4.2 h1:hYt8Qj6a8yLnvR+h7MwsJv/XvmBJXiueUcI3cIxsyig=
github.com/charmbracelet/log v0.4.2/go.mod h1:qifHGX/tc7eluv2R6pWIpyHDDrrb/AG71Pf2ysQu5nw=
github.com/charmbracelet/x/ansi v0.8.0 h1:9GTq3xq9caJW8ZrBTe0LIe2fvfLR/bYXKTx2llXn7xE=
github.com/charmbracelet/x/ansi v0.8.0/go.mod h1:wdYl/ONOLHLIVmQaxbIYEC/cRKOQyjTkowiI4blgS9Q=
github.com/charmbracelet/x/cellbuf v0.0.13 h1:/KBBKHuVRbq1lYx5BzEHBAFBP8VcQzJejZ/IA3iR28k=
github.com/charmbracelet/x/cellbuf v0.0.13/go.mod h1:xe0nKWGd3eJgtqZRaN9RjMtK7xUYchjzPr7q6kcvCCs=
github.com/charmbracelet/x/term v0.2.1 h1:AQeHeLZ1OqSXhrAWpYUtZyX1T3zVxfpZuEQMIQaGIAQ=
github.com/charmbracelet/x/term v0.2.1/go.mod h1:oQ4enTYFV7QN4m0i9mzHrViD7TQKvNEEkHUMCmsxdUg=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/go-logfmt/logfmt v0.6.0 h1:wGYYu3uicYdqXVgoYbvnkrPVXkuLM1p1ifugDMEdRi4=
github.com/go-logfmt/logfmt v0.6.0/go.mod h1:WYhtIu8zTZfxdn5+rREduYbwxfcBr/Vr6KEVveWlfTs=
github.com/go-viper/mapstructure/v2 v2.2.1 h1:ZAaOCxANMuZx5RCeg0mBdEZk7DZasvvZIxtHqx8aGss=
github.com/go-viper/mapstructure/v2 v2.2.1/go.mod h1:oJDH3BJKyqBA2TXFhDsKDGDTlndYOZ6rGS0BRZIxGhM=
github.com/goccy/go-json v0.10.5 h1:Fq85nIqj+gXn/S5ahsiTlK3TmC85qgirsdTP/+DeaC4=
github.com/goccy/go-json v0.10.5/go.mod h1:oq7eo15ShAhp70Anwd5lgX2pLfOS3QCiwU/PULtXL6M=
github.com/golang/snappy v0.0.4 h1:yAGX7huGHXlcLOEtBnF4w7FQwA26wojNCwOYAEhLjQM=
github.com/golang/snappy v0.0.4/go.mod h1:/XxbfmMg8lxefKM7IXC3fBNl/7bRcc72aCRzEWrmP2Q=
github.com/google/flatbuffers v25.1.24+incompatible h1:4wPqL3K7GzBd1CwyhSd3usxLKOaJN/AC6puCca6Jm7o=
github.com/google/flatbuffers v25.1.24+incompatible/go.mod h1:1AeVuKshWv4vARoZatz6mlQ0JxURH0Kv5+zNeJKJCa8=
github.com/google/go-cmp v0.6.0 h1:ofyhxvXcZhMsU5ulbFiLKl/XBFqE1GSq7atu8tAmTRI=
github.com/google/go-cmp v0.6.0/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY=
github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
github.com/klauspost/asmfmt v1.3.2 h1:4Ri7ox3EwapiOjCki+hw14RyKk201CN4rzyCJRFLpK4=
github.com/klauspost/asmfmt v1.3.2/go.mod h1:AG8TuvYojzulgDAMCnYn50l/5QV3Bs/tp6j0HLHbNSE=
github.com/klauspost/compress v1.17.11 h1:In6xLpyWOi1+C7tXUUWv2ot1QvBjxevKAaI6IXrJmUc=
github.com/klauspost/compress v1.17.11/go.mod h1:pMDklpSncoRMuLFrf1W9Ss9KT+0rH90U12bZKk7uwG0=
github.com/klauspost/cpuid/v2 v2.2.9 h1:66ze0taIn2H33fBvCkXuv9BmCwDfafmiIVpKV9kKGuY=
github.com/klauspost/cpuid/v2 v2.2.9/go.mod h1:rqkxqrZ1EhYM9G+hXH7YdowN5R5RGN6NK4QwQ3WMXF8=
github.com/lucasb-eyer/go-colorful v1.2.0 h1:1nnpGOrhyZZuNyfu1QjKiUICQ74+3FNCN69Aj6K7nkY=
github.com/lucasb-eyer/go-colorful v1.2.0/go.mod h1:R4dSotOR9KMtayYi1e77YzuveK+i7ruzyGqttikkLy0=
github.com/marcboeker/go-duckdb v1.8.5 h1:tkYp+TANippy0DaIOP5OEfBEwbUINqiFqgwMQ44jME0=
github.com/marcboeker/go-duckdb v1.8.5/go.mod h1:6mK7+WQE4P4u5AFLvVBmhFxY5fvhymFptghgJX6B+/8=
github.com/mattn/go-isatty v0.0.20 h1:xfD0iDuEKnDkl03q4limB+vH+GxLEtL/jb4xVJSWWEY=
github.com/mattn/go-isatty v0.0.20/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D7dTCTo3Y=
github.com/mattn/go-runewidth v0.0.16 h1:E5ScNMtiwvlvB5paMFdw9p4kSQzbXFikJ5SQO6TULQc=
github.com/mattn/go-runewidth v0.0.16/go.mod h1:Jdepj2loyihRzMpdS35Xk/zdY8IAYHsh153qUoGf23w=
github.com/minio/asm2plan9s v0.0.0-20200509001527-cdd76441f9d8 h1:AMFGa4R4MiIpspGNG7Z948v4n35fFGB3RR3G/ry4FWs=
github.com/minio/asm2plan9s v0.0.0-20200509001527-cdd76441f9d8/go.mod h1:mC1jAcsrzbxHt8iiaC+zU4b1ylILSosueou12R++wfY=
github.com/minio/c2goasm v0.0.0-20190812172519-36a3d3bbc4f3 h1:+n/aFZefKZp7spd8DFdX7uMikMLXX4oubIzJF4kv/wI=
github.com/minio/c2goasm v0.0.0-20190812172519-36a3d3bbc4f3/go.mod h1:RagcQ7I8IeTMnF8JTXieKnO4Z6JCsikNEzj0DwauVzE=
github.com/muesli/cancelreader v0.2.2 h1:3I4Kt4BQjOR54NavqnDogx/MIoWBFa0StPA8ELUXHmA=
github.com/muesli/cancelreader v0.2.2/go.mod h1:3XuTXfFS2VjM+HTLZY9Ak0l6eUKfijIfMUZ4EgX0QYo=
github.com/muesli/termenv v0.16.0 h1:S5AlUN9dENB57rsbnkPyfdGuWIlkmzJjbFf0Tf5FWUc=
github.com/muesli/termenv v0.16.0/go.mod h1:ZRfOIKPFDYQoDFF4Olj7/QJbW60Ol/kL1pU3VfY/Cnk=
github.com/pierrec/lz4/v4 v4.1.22 h1:cKFw6uJDK+/gfw5BcDL0JL5aBsAFdsIT18eRtLj7VIU=
github.com/pierrec/lz4/v4 v4.1.22/go.mod h1:gZWDp/Ze/IJXGXf23ltt2EXimqmTUXEy0GFuRQyBid4=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/rivo/uniseg v0.2.0/go.mod h1:J6wj4VEh+S6ZtnVlnTBMWIodfgj8LQOQFoIToxlJtxc=
github.com/rivo/uniseg v0.4.7 h1:WUdvkW8uEhrYfLC4ZzdpI2ztxP1I582+49Oc5Mq64VQ=
github.com/rivo/uniseg v0.4.7/go.mod h1:FN3SvrM+Zdj16jyLfmOkMNblXMcoc8DfTHruCPUcx88=
github.com/spf13/pflag v1.0.10 h1:4EBh2KAYBwaONj6b2Ye1GiHfwjqyROoF4RwYO+vPwFk=
github.com/spf13/pflag v1.0.10/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
github.com/stretchr/testify v1.10.0 h1:Xv5erBjTwe/5IxqUQTdXv5kgmIvbHo3QQyRwhJsOfJA=
github.com/stretchr/testify v1.10.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY=
github.com/xo/terminfo v0.0.0-20220910002029-abceb7e1c41e h1:JVG44RsyaB9T2KIHavMF/ppJZNG9ZpyihvCd0w101no=
github.com/xo/terminfo v0.0.0-20220910002029-abceb7e1c41e/go.mod h1:RbqR21r5mrJuqunuUZ/Dhy/avygyECGrLceyNeo4LiM=
github.com/zeebo/assert v1.3.0 h1:g7C04CbJuIDKNPFHmsk4hwZDO5O+kntRxzaUoNXj+IQ=
github.com/zeebo/assert v1.3.0/go.mod h1:Pq9JiuJQpG8JLJdtkwrJESF0Foym2/D9XMU5ciN/wJ0=
github.com/zeebo/xxh3 v1.0.2 h1:xZmwmqxHZA8AI603jOQ0tMqmBr9lPeFwGg6d+xy9DC0=
github.com/zeebo/xxh3 v1.0.2/go.mod h1:5NWz9Sef7zIDm2JHfFlcQvNekmcEl9ekUZQQKCYaDcA=
golang.org/x/exp v0.0.0-20250128182459-e0ece0dbea4c h1:KL/ZBHXgKGVmuZBZ01Lt57yE5ws8ZPSkkihmEyq7FXc=
golang.org/x/exp v0.0.0-20250128182459-e0ece0dbea4c/go.mod h1:tujkw807nyEEAamNbDrEGzRav+ilXA7PCRAd6xsmwiU=
golang.org/x/mod v0.22.0 h1:D4nJWe9zXqHOmWqj4VMOJhvzj7bEZg4wEYa759z1pH4=
golang.org/x/mod v0.22.0/go.mod h1:6SkKJ3Xj0I0BrPOZoBy3bdMptDDU9oJrpohJ3eWZ1fY=
golang.org/x/sync v0.10.0 h1:3NQrjDixjgGwUOCaF8w2+VYHv0Ve/vGYSbdkTa98gmQ=
golang.org/x/sync v0.10.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk=
golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.31.0 h1:ioabZlmFYtWhL+TRYpcnNlLwhyxaM9kWTDEmfnprqik=
golang.org/x/sys v0.31.0/go.mod h1:BJP2sWEmIv4KK5OTEluFJCKSidICx8ciO85XgH3Ak8k=
golang.org/x/tools v0.29.0 h1:Xx0h3TtM9rzQpQuR4dKLrdglAmCEN5Oi+P74JdhdzXE=
golang.org/x/tools v0.29.0/go.mod h1:KMQVMRsVxU6nHCFXrBPhDB8XncLNLM0lIy/F14RP588=
golang.org/x/xerrors v0.0.0-20240903120638-7835f813f4da h1:noIWHXmPHxILtqtCOPIhSt0ABwskkZKjD3bXGnZGpNY=
golang.org/x/xerrors v0.0.0-20240903120638-7835f813f4da/go.mod h1:NDW/Ps6MPRej6fsCIbMTohpP40sJ/P/vI1MoTEGwX90=
gonum.org/v1/gonum v0.15.1 h1:FNy7N6OUZVUaWG9pTiD+jlhdQ3lMP+/LcTpJ6+a8sQ0=
gonum.org/v1/gonum v0.15.1/go.mod h1:eZTZuRFrzu5pcyjN5wJhcIhnUdNijYxX1T2IcrOGY0o=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=

View File

@@ -0,0 +1,145 @@
package main
import (
"database/sql"
"fmt"
"os"
"slices"
"strings"
"github.com/anotherhadi/eleakxir/leak-utils/misc"
"github.com/anotherhadi/eleakxir/leak-utils/parquet"
"github.com/anotherhadi/eleakxir/leak-utils/settings"
"github.com/charmbracelet/log"
_ "github.com/marcboeker/go-duckdb"
flag "github.com/spf13/pflag"
)
func main() {
db, err := sql.Open("duckdb", "")
if err != nil {
log.Fatal("Failed to open DuckDB", "error", err)
}
defer db.Close()
lu := settings.LeakUtils{
Db: db,
}
actions := []string{
"cleanParquet",
"infoParquet",
// Csv
"csvToParquet",
// Misc
"mergeFiles",
"removeUrlSchemeFromUlp",
}
if len(os.Args) < 2 {
fmt.Println(settings.Muted.Render("Usage: "), settings.Accent.Render(os.Args[0], "<action>"))
fmt.Println(settings.Muted.Render("Actions: "), settings.Base.Render(strings.Join(actions, ", ")))
return
}
action := os.Args[1]
if !slices.Contains(actions, action) {
log.Fatal("Unknown action", "action", action)
}
switch action {
case "cleanParquet":
var inputFile *string = flag.StringP("input", "i", "", "Input Parquet file")
var outputFile *string = flag.StringP("output", "o", "", "Output Parquet file")
var compression *string = flag.StringP("compression", "c", "ZSTD", "Compression codec (UNCOMPRESSED, SNAPPY, GZIP, BROTLI, LZ4, ZSTD)")
var skipLineFormating *bool = flag.BoolP("skip-line-formating", "s", false, "Skip line formating")
var deleteFirstRow *bool = flag.Bool("delete-first-row", false, "Delete first row")
var debug *bool = flag.Bool("debug", false, "Debug mode")
var noColors *bool = flag.Bool("no-colors", false, "Remove all colors")
var printQuery *bool = flag.BoolP("print-query", "p", false, "Print the query instead of executing it")
flag.Parse()
if *inputFile == "" || *outputFile == "" {
log.Fatal("Input and output files are required")
}
if *noColors {
settings.DisableColors()
}
lu.Compression = *compression
lu.Debug = *debug
err := parquet.CleanParquet(lu, *inputFile, *outputFile, *skipLineFormating, *deleteFirstRow, *printQuery)
if err != nil {
log.Fatal("Failed to clean Parquet file", "error", err)
}
return
case "infoParquet":
var inputFile *string = flag.StringP("input", "i", "", "Input Parquet file")
var debug *bool = flag.Bool("debug", false, "Debug mode")
var noColors *bool = flag.Bool("no-colors", false, "Remove all colors")
flag.Parse()
if *inputFile == "" {
log.Fatal("Input files are required")
}
if *noColors {
settings.DisableColors()
}
lu.Debug = *debug
err := parquet.InfoParquet(lu, *inputFile)
if err != nil {
log.Fatal("Failed to read Parquet file", "error", err)
}
return
case "csvToParquet":
var inputFile *string = flag.StringP("input", "i", "", "Input Parquet file")
var outputFile *string = flag.StringP("output", "o", "", "Output Parquet file")
var strict *bool = flag.Bool("strict", true, "Strict mode for Duckdb")
var compression *string = flag.StringP("compression", "c", "ZSTD", "Compression codec (UNCOMPRESSED, SNAPPY, GZIP, BROTLI, LZ4, ZSTD)")
var noColors *bool = flag.Bool("no-colors", false, "Remove all colors")
var debug *bool = flag.Bool("debug", false, "Debug mode")
flag.Parse()
if *inputFile == "" || *outputFile == "" {
log.Fatal("Input and output files are required")
}
if *noColors {
settings.DisableColors()
}
lu.Compression = *compression
lu.Debug = *debug
err := misc.CsvToParquet(lu, *inputFile, *outputFile, *strict)
if err != nil {
log.Fatal("Failed to transform Csv file", "error", err)
}
return
case "mergeFiles":
var inputFiles *[]string = flag.StringArrayP("inputs", "i", []string{}, "Input Parquet files")
var outputFile *string = flag.StringP("output", "o", "", "Output Parquet file")
var noColors *bool = flag.Bool("no-colors", false, "Remove all colors")
var debug *bool = flag.Bool("debug", false, "Debug mode")
flag.Parse()
if len(*inputFiles) == 0 || *outputFile == "" {
log.Fatal("Inputs and output files are required")
}
if *noColors {
settings.DisableColors()
}
lu.Debug = *debug
err := misc.MergeFiles(lu, *outputFile, *inputFiles...)
if err != nil {
log.Fatal("Failed to merge files", "error", err)
}
return
case "removeUrlSchemeFromUlp":
var inputFile *string = flag.StringP("input", "i", "", "Input Parquet file")
var noColors *bool = flag.Bool("no-colors", false, "Remove all colors")
var debug *bool = flag.Bool("debug", false, "Debug mode")
flag.Parse()
if *inputFile == "" {
log.Fatal("Input files are required")
}
if *noColors {
settings.DisableColors()
}
lu.Debug = *debug
err := misc.RemoveUrlSchemeFromUlp(lu, *inputFile)
if err != nil {
log.Fatal("Failed to remove ULP Url schemes", "error", err)
}
return
}
}

173
leak-utils/misc/csv.go Normal file
View File

@@ -0,0 +1,173 @@
package misc
import (
"bufio"
"encoding/csv"
"fmt"
"io"
"os"
"slices"
"strings"
"github.com/anotherhadi/eleakxir/leak-utils/settings"
"github.com/charmbracelet/log"
)
func CsvToParquet(lu settings.LeakUtils, inputFile string, outputFile string, strict bool) error {
hasHeader, err := csvHasHeader(inputFile)
if err != nil {
return err
}
header := "true"
if !hasHeader {
header = "false"
}
strictMode := "true"
if !strict {
strictMode = "false"
}
delimiter := getDelimiter(inputFile)
query := fmt.Sprintf(`CREATE TABLE my_table AS FROM read_csv_auto('%s', HEADER=%s, delim='%s', ignore_errors=true, all_varchar=true, null_padding=true, strict_mode=%s);
COPY my_table TO '%s' (FORMAT 'parquet', COMPRESSION '%s', ROW_GROUP_SIZE 200_000);`,
inputFile, header, delimiter, strictMode, outputFile, lu.Compression)
if lu.Debug {
log.Info("Detected delimiter", "delimiter", delimiter)
log.Info("CSV header detection", "hasHeader", hasHeader)
log.Info("Executing query", "query", query)
}
_, err = lu.Db.Exec(query)
if lu.Debug {
log.Info("Finished executing query")
}
return err
}
func getDelimiter(inputFile string) string {
lines, err := getNLine(inputFile, 10, 0)
if err != nil {
log.Warn("Failed to read CSV file to determine delimiter, defaulting to comma", "error", err)
return ","
}
delimiterCounts := map[string]int{
",": 0,
";": 0,
"\t": 0,
"|": 0,
":": 0,
}
for _, line := range lines {
for d := range delimiterCounts {
delimiterCounts[d] += strings.Count(line, d)
}
}
maxCount := 0
delimiter := ","
for d, count := range delimiterCounts {
if count > maxCount {
maxCount = count
delimiter = d
}
}
return delimiter
}
func csvHasHeader(inputFile string) (hasHeader bool, err error) {
firstRow, err := getFirstRowCsv(inputFile)
if err != nil {
return false, err
}
for i, col := range firstRow {
col = strings.ReplaceAll(col, "\"", "")
col = strings.ReplaceAll(col, " ", "")
col = strings.ReplaceAll(col, "-", "")
col = strings.ReplaceAll(col, "_", "")
col = strings.ReplaceAll(col, ".", "")
firstRow[i] = strings.ToLower(strings.TrimSpace(col))
}
knownHeaders := []string{"email", "password", "username", "phone", "lastname", "firstname"}
for _, knownHeader := range knownHeaders {
if slices.Contains(firstRow, knownHeader) {
return true, nil
}
}
return false, nil
}
func getNLine(inputFile string, n, offset int) (lines []string, err error) {
if n <= 0 {
return nil, nil
}
if offset < 0 {
offset = 0
}
file, err := os.Open(inputFile)
if err != nil {
return nil, err
}
defer file.Close()
scanner := bufio.NewScanner(file)
currentLine := 0
for scanner.Scan() {
currentLine++
if currentLine <= offset {
continue
}
lines = append(lines, scanner.Text())
if len(lines) >= n {
break
}
}
if err := scanner.Err(); err != nil && err != io.EOF {
return nil, err
}
return lines, nil
}
func getFirstRowCsv(inputFile string) (row []string, err error) {
rows, err := getFirstNRowsCsv(inputFile, 1)
if len(rows) == 0 {
return nil, fmt.Errorf("no rows found in CSV")
}
return rows[0], err
}
func getFirstNRowsCsv(inputFile string, n int) (rows [][]string, err error) {
f, err := os.Open(inputFile)
if err != nil {
return nil, fmt.Errorf("failed to open file: %w", err)
}
defer f.Close()
reader := csv.NewReader(f)
for i := 0; i < n; i++ {
row, err := reader.Read()
if err != nil {
if err.Error() == "EOF" {
break
}
return nil, fmt.Errorf("failed to read CSV: %w", err)
}
rows = append(rows, row)
}
return rows, nil
}

31
leak-utils/misc/misc.go Normal file
View File

@@ -0,0 +1,31 @@
package misc
import (
"io"
"os"
"github.com/anotherhadi/eleakxir/leak-utils/settings"
)
func MergeFiles(lu settings.LeakUtils, outputFile string, inputFiles ...string) error {
out, err := os.Create(outputFile)
if err != nil {
return err
}
defer out.Close()
for _, inputFile := range inputFiles {
file, err := os.Open(inputFile)
if err != nil {
return err
}
defer file.Close()
_, err = io.Copy(out, file)
if err != nil {
return err
}
}
return nil
}

67
leak-utils/misc/ulp.go Normal file
View File

@@ -0,0 +1,67 @@
package misc
import (
"bufio"
"io"
"os"
"strings"
"github.com/anotherhadi/eleakxir/leak-utils/settings"
)
func RemoveUrlSchemeFromUlp(lu settings.LeakUtils, inputFile string) error {
file, err := os.Open(inputFile)
if err != nil {
return err
}
defer file.Close()
outputFile := inputFile + ".clean"
out, err := os.Create(outputFile)
if err != nil {
return err
}
defer out.Close()
reader := bufio.NewReader(file)
writer := bufio.NewWriter(out)
for {
line, err := reader.ReadString('\n')
if err != nil && err != io.EOF {
return err
}
firstColumn := strings.Index(line, ":")
firstScheme := strings.Index(line, "://")
if firstScheme != -1 && firstColumn == firstScheme {
line = line[firstScheme+3:]
}
_, werr := writer.WriteString(line)
if werr != nil {
return err
}
if err == io.EOF {
break
}
}
err = writer.Flush()
if err != nil {
return err
}
err = os.Remove(inputFile)
if err != nil {
return err
}
err = os.Rename(outputFile, inputFile)
if err != nil {
return err
}
return nil
}

View File

@@ -0,0 +1,107 @@
package parquet
import (
"fmt"
"strings"
"github.com/anotherhadi/eleakxir/leak-utils/settings"
)
// If there is no full_name but there is last_name and first_name, create full_name
// If there is no full_name, no last_name or no first_name, but there is name, rename name to full_name
func addFullname(operations []ColumnOperation) []ColumnOperation {
hasFullName := false
hasFirstName := false
hasLastName := false
hasName := false
for _, op := range operations {
if op.Action != "drop" {
if op.NewName == "full_name" {
hasFullName = true
} else if op.NewName == "first_name" {
hasFirstName = true
} else if op.NewName == "last_name" {
hasLastName = true
} else if op.NewName == "name" {
hasName = true
}
}
}
if hasFullName {
return operations
}
if hasFirstName && hasLastName {
operations = append(operations, ColumnOperation{
OriginalName: "first_name || ' ' || last_name",
NewName: "full_name",
Action: "rename",
})
fmt.Println(settings.Muted.Render("\nAdding new column 'full_name' as concatenation of 'first_name' and 'last_name'."))
return operations
}
if hasName {
for i, op := range operations {
if op.NewName == "name" && op.Action != "drop" {
operations[i].NewName = "full_name"
fmt.Println(settings.Muted.Render("\nRenaming column 'name' to 'full_name'."))
return operations
}
}
}
if hasFirstName {
operations = append(operations, ColumnOperation{
OriginalName: "first_name",
NewName: "full_name",
Action: "rename",
})
fmt.Println(settings.Muted.Render("\nAdding new column 'full_name' from 'first_name'."))
return operations
}
if hasLastName {
operations = append(operations, ColumnOperation{
OriginalName: "last_name",
NewName: "full_name",
Action: "rename",
})
fmt.Println(settings.Muted.Render("\nAdding new column 'full_name' from 'last_name'."))
return operations
}
return operations
}
// formatColumnName formats a column name to be SQL-compliant.
func formatColumnName(columnName string) string {
columnName = strings.TrimSpace(columnName)
columnName = strings.ToLower(columnName)
columnName = strings.Join(strings.Fields(columnName), "_")
columnName = strings.ReplaceAll(columnName, "\"", "")
columnName = strings.ReplaceAll(columnName, "'", "")
columnName = strings.ReplaceAll(columnName, " ", "_")
columnName = strings.ReplaceAll(columnName, "-", "_")
// Only keep a-z, 0-9 and _
var formatted strings.Builder
for _, r := range columnName {
if (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9') || r == '_' {
formatted.WriteRune(r)
}
}
columnName = formatted.String()
columnName = strings.TrimPrefix(columnName, "_")
columnName = strings.TrimSuffix(columnName, "_")
return columnName
}
// formatColumns applies specific formatting rules to column operations.
func formatColumns(operations []ColumnOperation) []ColumnOperation {
formatedOperations := []ColumnOperation{}
for _, op := range operations {
if op.NewName == "phone" || strings.HasSuffix(op.NewName, "_phone") {
op.OriginalName = "REGEXP_REPLACE(" + op.OriginalName + ", '[^0-9]', '')"
} else if op.NewName == "email" || strings.HasSuffix(op.NewName, "_email") {
op.OriginalName = "REGEXP_REPLACE(LOWER(TRIM(" + op.OriginalName + ")), '[^a-z0-9._@-]', '')"
}
formatedOperations = append(formatedOperations, op)
}
return formatedOperations
}

View File

@@ -0,0 +1,276 @@
package parquet
import (
"bufio"
"database/sql"
"fmt"
"os"
"strings"
"github.com/anotherhadi/eleakxir/leak-utils/settings"
"github.com/charmbracelet/log"
)
type Parquet struct {
Filepath string
Filename string
Columns []string
Sample [][]string
NRows int64
Compression string // Compression of the output file (e.g., "SNAPPY", "ZSTD", "NONE" or "")
}
type ColumnOperation struct {
OriginalName string
NewName string
Action string // "keep", "rename", "drop"
}
func (parquet Parquet) PrintParquet() {
fmt.Println(settings.Header.Render(parquet.Filename) + "\n")
fmt.Println(settings.Accent.Render("File path:"), settings.Base.Render(parquet.Filepath))
fmt.Println(settings.Accent.Render("Number of columns:"), settings.Base.Render(fmt.Sprintf("%d", len(parquet.Columns))))
fmt.Println(settings.Accent.Render("Number of rows:"), settings.Base.Render(formatWithSpaces(parquet.NRows)))
fmt.Println()
fmt.Println(settings.Accent.Render(strings.Join(parquet.Columns, " | ")))
for _, row := range parquet.Sample {
fmt.Println(settings.Base.Render(strings.Join(row, " | ")))
}
}
func InfoParquet(lu settings.LeakUtils, inputFile string) error {
parquet, err := GetParquet(lu.Db, inputFile)
if err != nil {
return err
}
parquet.PrintParquet()
return nil
}
func CleanParquet(lu settings.LeakUtils, inputFile, outputFile string, skipLineFormating, deleteFirstRow, printQuery bool) error {
input, err := GetParquet(lu.Db, inputFile)
if err != nil {
return err
}
input.PrintParquet()
columnOps := configureColumns(*input, skipLineFormating)
output := Parquet{
Filepath: outputFile,
Compression: lu.Compression,
}
err = transformParquet(lu, *input, output, columnOps, deleteFirstRow, printQuery)
return err
}
func configureColumns(input Parquet, skipLineFormating bool) []ColumnOperation {
reader := bufio.NewReader(os.Stdin)
var operations []ColumnOperation
fmt.Println()
fmt.Println(settings.Base.Render("For each column, choose an action:"))
fmt.Println(settings.Base.Render(" [k] Keep"))
fmt.Println(settings.Base.Render(" [r] Rename"))
fmt.Println(settings.Base.Render(" [d] Drop/Delete"))
fmt.Println(settings.Base.Render(" [s] Suggested"))
fmt.Println(settings.Base.Render(" [b] Go back"))
fmt.Println()
for i := 0; i < len(input.Columns); i++ {
col := input.Columns[i]
suggestion := getSuggestion(col)
for {
fmt.Println(settings.Muted.Render("\nColumn:"), settings.Accent.Render(col))
if suggestion != "" {
fmt.Println(settings.Alert.Render("Suggested action: Rename to '" + suggestion + "'"))
}
fmt.Print(settings.Base.Render("[k/r/d/s/b]: "))
input, err := reader.ReadString('\n')
if err != nil {
log.Printf("Error reading input: %v", err)
continue
}
input = strings.TrimSpace(strings.ToLower(input))
op := ColumnOperation{
OriginalName: col,
NewName: col,
Action: "keep",
}
switch input {
case "b", "back":
if i > 0 {
i -= 2
if len(operations) > 0 {
operations = operations[:len(operations)-1]
}
fmt.Println(settings.Muted.Render("Going back to the previous column..."))
} else {
fmt.Println(settings.Muted.Render("Already at the first column, cannot go back further."))
continue
}
goto nextColumn
case "r", "rename":
fmt.Print(settings.Base.Render("Enter new name: "))
newName, err := reader.ReadString('\n')
if err != nil {
log.Printf("Error reading new name: %v", err)
continue
}
newName = strings.TrimSpace(newName)
if newName != "" {
op.OriginalName = "\"" + op.OriginalName + "\""
op.NewName = formatColumnName(newName)
op.Action = "rename"
operations = append(operations, op)
goto nextColumn
} else {
fmt.Println(settings.Muted.Render("Invalid name, please try again."))
continue
}
case "s", "suggested":
if suggestion != "" {
op.OriginalName = "\"" + op.OriginalName + "\""
op.NewName = formatColumnName(suggestion)
op.Action = "rename"
} else {
fmt.Println(settings.Muted.Render("No valid suggestion available"))
continue
}
operations = append(operations, op)
goto nextColumn
case "d", "drop", "delete":
op.Action = "drop"
operations = append(operations, op)
goto nextColumn
case "k", "keep", "":
op.OriginalName = "\"" + op.OriginalName + "\""
op.NewName = formatColumnName(op.NewName)
op.Action = "rename"
operations = append(operations, op)
goto nextColumn
default:
fmt.Println(settings.Muted.Render("Invalid choice, please enter [k/r/d/s/b]."))
continue
}
}
nextColumn:
lastOp := operations[len(operations)-1]
switch lastOp.Action {
case "rename":
if formatColumnName(lastOp.OriginalName) == lastOp.NewName {
fmt.Printf(settings.Muted.Render("Keeping column '%s' as is.\n"), lastOp.OriginalName)
} else {
fmt.Printf(settings.Muted.Render("Renaming column '%s' to '%s'.\n"), lastOp.OriginalName, lastOp.NewName)
}
case "drop":
fmt.Printf(settings.Muted.Render("Dropping column '%s'.\n"), lastOp.OriginalName)
}
}
if !skipLineFormating {
operations = formatColumns(operations)
}
operations = addFullname(operations)
return operations
}
func transformParquet(lu settings.LeakUtils, input, output Parquet, operations []ColumnOperation, deleteFirstRow, printQuery bool) error {
var selectClauses []string
hasColumns := false
for _, op := range operations {
if op.Action != "drop" {
hasColumns = true
if op.Action == "rename" {
selectClauses = append(selectClauses, fmt.Sprintf("%s AS \"%s\"", op.OriginalName, op.NewName))
} else {
selectClauses = append(selectClauses, op.OriginalName)
}
}
}
if !hasColumns {
return fmt.Errorf("no columns selected for output")
}
selectClause := strings.Join(selectClauses, ", ")
compression := ""
if output.Compression != "" {
compression = ", COMPRESSION '" + output.Compression + "'"
}
columnsLength := []string{}
for _, col := range input.Columns {
columnsLength = append(columnsLength, "COALESCE(LENGTH(\""+col+"\"),0)")
}
allowedRowSize := 30 * len(input.Columns)
offset := ""
if deleteFirstRow {
offset = "OFFSET 1"
}
query := fmt.Sprintf(`
COPY (
SELECT %s
FROM read_parquet('%s')
WHERE (%s) < %d
%s
) TO '%s' (FORMAT PARQUET, ROW_GROUP_SIZE 200_000 %s)
`, selectClause, input.Filepath, strings.Join(columnsLength, "+"), allowedRowSize, offset, output.Filepath, compression)
if printQuery {
fmt.Println("Query:", query) // TODO: Remove tabs
return nil
}
fmt.Println(settings.Base.Render("\nTransforming and writing to output parquet..."))
_, err := lu.Db.Exec(query)
if err != nil {
return fmt.Errorf("failed to execute transformation: %w", err)
}
fmt.Println(settings.Base.Render("Transformation complete!\n"))
newParquet, err := GetParquet(lu.Db, output.Filepath)
if err != nil {
return err
}
newParquet.PrintParquet()
return nil
}
func GetParquet(db *sql.DB, inputFile string) (parquet *Parquet, err error) {
parquet = &Parquet{}
parquet.Filepath = inputFile
parquet.Columns, err = getColumns(db, inputFile)
if err != nil {
return
}
parquet.NRows, err = countRows(db, inputFile)
if err != nil {
return
}
parquet.Sample, err = getFirstNRows(db, inputFile, 6)
if err != nil {
return
}
n := strings.LastIndex(inputFile, "/")
if n == -1 {
parquet.Filename = inputFile
} else {
parquet.Filename = inputFile[n+1:]
}
return
}

View File

@@ -0,0 +1,81 @@
package parquet
import (
"slices"
)
func getSuggestion(col string) string {
col = formatColumnName(col)
knownNames := []string{
"date",
"phone",
"username",
"address",
"email",
"postal_code",
"city",
"country",
"state",
"age",
"gender",
"password",
"password_hash",
"full_name",
"last_name",
"name", // Will be renamed to full_name later
"first_name",
"birth_date",
"url",
"ip",
}
if slices.Contains(knownNames, col) {
return col
}
if col == "user" {
return "username"
}
if col == "login" {
return "username"
}
if col == "sex" {
return "gender"
}
if col == "ip_address" {
return "ip"
}
if col == "password_hashed" {
return "password_hash"
}
if col == "firstname" {
return "first_name"
}
if col == "lastname" {
return "last_name"
}
if col == "fullname" {
return "full_name"
}
if col == "mail" {
return "email"
}
if col == "zip" || col == "postalcode" || col == "zipcode" || col == "postal" || col == "zip_code" {
return "postal_code"
}
if col == "street_address" {
return "address"
}
if col == "hash" || col == "hashed_password" || col == "hash_password" {
return "password_hash"
}
if col == "birthdate" || col == "dob" || col == "date_of_birth" {
return "birth_date"
}
return ""
}
// HINTS:
// date: _date
// url: _url, link
// address: _address
//

105
leak-utils/parquet/utils.go Normal file
View File

@@ -0,0 +1,105 @@
package parquet
import (
"database/sql"
"fmt"
"strconv"
"strings"
)
// getColumns retrieves the column names from the Parquet file.
func getColumns(db *sql.DB, filepath string) ([]string, error) {
// Create a view from the parquet file
query := fmt.Sprintf("CREATE OR REPLACE VIEW parquet_view AS SELECT * FROM read_parquet('%s')", filepath)
_, err := db.Exec(query)
if err != nil {
return nil, fmt.Errorf("failed to create view: %w", err)
}
// Get column information
rows, err := db.Query("DESCRIBE parquet_view")
if err != nil {
return nil, fmt.Errorf("failed to describe view: %w", err)
}
defer rows.Close()
var columns []string
for rows.Next() {
var colName, colType, nullable, key, defaultVal, extra sql.NullString
err := rows.Scan(&colName, &colType, &nullable, &key, &defaultVal, &extra)
if err != nil {
return nil, fmt.Errorf("failed to scan row: %w", err)
}
if colName.Valid {
columns = append(columns, colName.String)
}
}
return columns, nil
}
// getFirstNRows retrieves the first N rows from the Parquet file.
func getFirstNRows(db *sql.DB, inputFile string, n int) ([][]string, error) {
query := fmt.Sprintf("SELECT * FROM read_parquet('%s') LIMIT %d", inputFile, n)
rows, err := db.Query(query)
if err != nil {
return nil, fmt.Errorf("failed to query parquet file: %w", err)
}
defer rows.Close()
cols, err := rows.Columns()
if err != nil {
return nil, fmt.Errorf("failed to get columns: %w", err)
}
var results [][]string
for rows.Next() {
values := make([]sql.NullString, len(cols))
valuePtrs := make([]any, len(cols))
for i := range values {
valuePtrs[i] = &values[i]
}
err := rows.Scan(valuePtrs...)
if err != nil {
return nil, fmt.Errorf("failed to scan row: %w", err)
}
var row []string
for _, val := range values {
if val.Valid {
row = append(row, val.String)
} else {
row = append(row, "NULL")
}
}
results = append(results, row)
}
return results, nil
}
// countRows counts the number of rows in the Parquet file.
func countRows(db *sql.DB, inputFile string) (int64, error) {
var count int64
err := db.QueryRow(fmt.Sprintf("SELECT COUNT(*) FROM read_parquet('%s')", inputFile)).Scan(&count)
if err != nil {
return 0, fmt.Errorf("failed to count rows: %w", err)
}
return count, nil
}
// formatWithSpaces formats an integer with spaces as thousand separators.
func formatWithSpaces(n int64) string {
s := strconv.FormatInt(n, 10)
var b strings.Builder
l := len(s)
for i, c := range s {
if i != 0 && (l-i)%3 == 0 {
b.WriteRune(' ')
}
b.WriteRune(c)
}
return b.String()
}

View File

@@ -0,0 +1,27 @@
package settings
import (
"github.com/charmbracelet/lipgloss/v2"
)
var (
purple = lipgloss.Color("99")
lightPurple = lipgloss.Color("98")
yellow = lipgloss.Color("220")
gray = lipgloss.Color("245")
lightGray = lipgloss.Color("241")
Header = lipgloss.NewStyle().Foreground(purple).Bold(true)
Accent = lipgloss.NewStyle().Foreground(lightPurple)
Base = lipgloss.NewStyle().Foreground(lightGray)
Alert = lipgloss.NewStyle().Foreground(yellow).Bold(true)
Muted = lipgloss.NewStyle().Foreground(gray)
)
func DisableColors() {
Header = lipgloss.NewStyle()
Accent = lipgloss.NewStyle()
Base = lipgloss.NewStyle()
Alert = lipgloss.NewStyle()
Muted = lipgloss.NewStyle()
}

View File

@@ -0,0 +1,9 @@
package settings
import "database/sql"
type LeakUtils struct {
Debug bool
Compression string // Compression of the output file (e.g., "SNAPPY", "ZSTD", "NONE" or "")
Db *sql.DB
}