BOORU CHARS OPEN DATASET is an attempt to consolidate and arrange available character-centric
almost-SFW anime/CG/game art in localized form suited both for batch processing and visual estimation
**This release of BOORU CHARS consist of :**
- **1.593.429 sample images**, mogrified to 1280px (1024px for 1х1)
* grouped into 18 volumes (directories) by aspect ratio and year
* zipped by 1000 images according to statistics similarity
* with verbose file naming **%website% - %id% - %copyright% ~ %characters% (%artist%)**
* with some tags placed into EXIF
- several tab-separated texts with metadata
* post/image info (for samples, originals and from imageboard) 1.593.429 rows
* collected tag info with some addons - 35.222.997 rows
* listing for 32 torrents total with 3.839.005 pictures (almost 5 ТБ - the basis of dataset)
- an example of dataset usage for body objects detection and character "assembling"
* detector [notAI-tech NudeNet](https://github.com/notAI-tech/NudeNet) results for 2 volumes and composed output listings
* ~4000 most interesting visualisations (samples with detections drawn)
- some verbose descriptions
* **readme_RU/EN** with code examples and a lot of references
* several zipped Excels with analytic results and SQL-s
* some illustrative screenshots
**The main features of dataset are:**
- several sources but unique image identification **%website% + %id%**
* original images can be found in torrents (nyaa, rutracker)
* selective regrab of originals possible if source website available
- careful deduplication with relative website priorities, high to low (mostly)
* safebooru.org
* yande.re
* gelbooru.com
* anime-pictures.net
* konachan.com
* zerochan.net
* chan.sankakucomplex.com
* danbooru.donmai.us
* e-shuushuu.net
* tbib.org
- segmentation by chronology (estimated year of release) and by aspect ratio
* "artbook pages" **7x10 (+/- 4%)**
* “wide pages” **3x4 (+/- 10%)**
* “squares” **1x1 (+/- 20%)**
* “wallpapers and computer screens” **3x2 (+/- 40%)**
* "tall pages" **2x3 (+/- 40%)** folder name contains 1x2
- rather high original images technical and visual quality
* width>=900 height>=900 MPixels>=1.2
* most of comixes, lineart, overtexted images excluded, no photo, almost no characterless scenes
- not completely SFW (a little bit sotfcore ecchi here and there)
Earlier version of this dataset [(2019, 512px)](https://nyaa.si/view/1206322) has to be treated as obsolete.
I hope that this half-terabyte of data worth more than the same size chia coin mining pool.
NOTE-1 several standalone [not SFW datasets at Sukebei](https://sukebei.nyaa.si/user/AlexPUA) also with sample images, metadata and some analysis done.
NOTE-2 neural network architecture [YOLO](https://github.com/ultralytics/ultralytics) seems to be very good for art. I already have [promising results](https://github.com/aperveyev/booru_yolo), stay tuned.
NOTE-3 there is similar [BOORU CHARS 2015 dataset](https://nyaa.si/view/1468367) for "early art"
NOTE-4 next [BOORU CHARS 2022](https://nyaa.si/view/1547662) release over volumes: [2021 b-c-d](https://nyaa.si/view/1462329) and [2022 a-b](https://nyaa.si/view/1539363)
Comments - 1
SomaHeir