Data Preparation & JSON Generation
Before running any training or evaluation, you must unpack the raw image archives and generate the JSON index files that your YAML configurations will reference.
Directory Layout
Place the downloaded DiffusionForensics archives under a single root, for example:
DiffusionForensics/
├─ dire/
│ ├─ train/
│ │ ├─ imagenet/
│ │ │ ├─ real.zip ← ADM-generated “real” images
│ │ │ └─ adm.zip ← ADM-generated “fake” images
│ ├─ val/
│ │ └─ imagenet/… ← same structure for validation
│ └─ test/
│ └─ imagenet/… ← same structure for testing
└─ … (other tasks/domains)
In this zip file, you should have:
DiffusionForensics/dire/train/imagenet/
├─ real/
│ ├─ 000/ (40 .png)
│ ├─ 001/ (40 .png)
│ └─ …
└─ adm/
├─ 000/ (40 .png)
├─ 001/ (40 .png)
└─ …
JSON Index Generation Script
Below is a minimal Python script (generate_json.py
) to walk through each folder, assign labels, and dump a train.json
(or val.json
/ test.json
):
import os
import json
def collect(root_dir, label):
records = []
for subdir, _, files in os.walk(root_dir):
for fname in files:
if fname.lower().endswith('.png'):
path = os.path.join(subdir, fname)
# convert to forward slashes for cross-platform
records.append({
"path": path.replace("\\", "/"),
"label": label
})
return records
if __name__ == "__main__":
# Adjust these paths as needed
base = "DiffusionForensics/dire/train/imagenet"
real_dir = os.path.join(base, "real")
adm_dir = os.path.join(base, "adm")
# 0 = real, 1 = adm-fake
data = collect(real_dir, 0) + collect(adm_dir, 1)
output = os.path.join("DiffusionForensics", "dire", "train.json")
os.makedirs(os.path.dirname(output), exist_ok=True)
with open(output, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"Wrote {len(data)} records to {output}")
Save this script at your repo root (next to statics/
, training_scripts/
, etc.). It will generate DiffusionForensics/dire/train.json
.
To generate validation and test splits, simply adjust base
to:
# For validation
base = "DiffusionForensics/dire/val/imagenet"
# For testing
base = "DiffusionForensics/dire/test/imagenet"
The resulting JSON file is a list of objects:
[
{ "path": "DiffusionForensics/dire/train/imagenet/real/000/0001.png", "label": 0 },
{ "path": "DiffusionForensics/dire/train/imagenet/real/000/0002.png", "label": 0 },
…,
{ "path": "DiffusionForensics/dire/train/imagenet/adm/999/039.png", "label": 1 },
…
]
path
: forward-slash style, relative to your project root or absolute. label
: integer class (0 = real, 1 = ADM-fake).
Integrate with YAML
In your statics/aigc/resnet_train.yaml
, point the dataset paths at these JSON files:
train_dataset:
name: AIGCLabelDataset
init_config:
image_size: 224
path: DiffusionForensics/dire/train.json
test_dataset:
- name: AIGCLabelDataset
init_config:
image_size: 224
path: DiffusionForensics/dire/test.json
That completes the data prep step—now your training scripts can load images and labels directly from these JSON indexes.