Evaluate Metrics¶

To start the evaluation on a given list of metrics, you need to instantiate a dataset inherting from our DiagnosticDataset. The calculation starts by calling calculate_metrics, and pass the model adapter, dataset, output directory and amount of trials as parameters. The parameter trials refers to the number of monte carlo trials that are performed and averaged for respective metrics.

The following code block contains an example how a script could look like.

from vqa_benchmarking_backend.datasets.GQADataset import GQADataset  # or import your own dataset
from vqa_benchmarking_backend.metrics.metrics import calculate_metrics

output_dir = '/path/to/my/ouput/dir' # set output directory for results. This should match the directory you are supplying to the webserver in webapp/server.py

# directories containing the data
qsts_path = 'path/to/GQA/questions.json'
img_dir   = 'path/to/GQA/images/'

# file that contains a dictionary mapping from answer index to answer text: {idx: ans_str}
idx2ans = load_idx_mapping()

# instantiate dataset using data directories and index/answer mapping
dataset = GQADataset(question_file=qsts_path, img_dir= img_dir, img_feat_dir='', idx2ans=idx2ans, name='GQA')

# define a list with all metrics the model should be tested on. Remove as needed.
metrics = [
    'accuracy',
    'question_bias_imagespace',
    'image_bias_wordspace',
    'image_robustness_imagespace',
    'image_robustness_featurespace',
    'question_robustness_featurespace',
    'sears',
    'uncertainty'
]

# Run the metrics calculation. Once finished, start the webserver at webapp/server.py and the vue.js app using 'npm start' in webapp/ folder, then inspect the results in your webbrowser.
calculate_metrics(adapter=model_adapter, dataset=dataset, output_path=output_dir, metrics=metrics, trials=7)