Running Training & Evaluation

With your YAML configs and JSON indexes in place, you can now launch model training and subsequent evaluation. Below we cover single‐GPU vs multi‐GPU, monitoring progress, resuming from checkpoints, and running in “test” mode.

Prerequisites

Project root
Make sure you’re in the directory that contains statics/, training_scripts/, common/, etc.:
Environment & dependencies Activate your virtualenv or Conda environment, then install requirements if you haven’t already; Ensure you have a CUDA‐enabled PyTorch (with torchrun) if you plan to use GPUs.
YAML & JSON Your statics/aigc/resnet_train.yaml should point to DiffusionForensics/dire/train.json. If you have separate val.json or test.json, your YAML should include a test_dataset: section.

Single-GPU Training

For a quick sanity check on one GPU:

CUDA_VISIBLE_DEVICES=0 \
python training_scripts/train.py \
  --config statics/aigc/resnet_train.yaml

This will: Load your train.json via your designated dataset Build Resnet50(pretrained=True, image_size=224) Run through epochs as specified in your YAML Write logs & checkpoints to log_dir

Multi-GPU (DDP) Training

To leverage multiple GPUs, use torchrun or the wrapper script: Using torchrun:

CUDA_VISIBLE_DEVICES=0,1 \
torchrun \
  --standalone \
  --nnodes=1 \
  --nproc_per_node=2 \
  training_scripts/train.py \
  --config statics/aigc/resnet_train.yaml

--nproc_per_node should match the number of GPUs listed in gpus: in your YAML. Using the helper script:

bash statics/run.sh statics/aigc/resnet_train.yaml

If you’ve overridden the default path in run.sh, include:

yaml_config="statics/aigc/resnet_train.yaml" bash statics/run.sh

Monitoring Progress

In your log_dir (from YAML), you’ll find: logs.log → stdout (loss, accuracy per epoch) error.log → stderr (stack traces, warnings) Tail live output:

tail -f log/aigc_resnet_df_train/logs.log

TensorBoard (if configured):

tensorboard --logdir log/aigc_resnet_df_train

Resuming from Checkpoint

To resume interrupted training: In your YAML, set:

resume: "path/to/checkpoint.pth"
start_epoch: 5

Re-run the same launch command. Training will pick up from epoch 5.

Running in Test Mode

Prepare your test YAML: Set your train config to statics/aigc/resnet_test.yaml; e.g.:

flag: test
test_dataset:
  - name: AIGCLabelDataset
    init_config:
      image_size: 224
      path: DiffusionForensics/dire/test.json

Launch evaluation

bash statics/run.sh statics/aigc/resnet_test.yaml