Integrate new Datasets¶
This document provides a brief overview how to integrate a new benchmarking dataset.
We provide two classes that the new dataset needs to inherit from:
DataSample
DiagnosticDataset
Create new Data Samples¶
Each sample of a dataset is represented as an object of type vqa_benchmarking_backend.datasets.dataset.DataSample
.
It stores all the relevant information, like the id’s for the question and image, the tokenized question, the corresponding answer,
and the path to the image.
The following code block contains an exemplary DataSample
from vqa_benchmarking_backend.datasets.dataset import DataSample
class MyDataSample(DataSample):
def __init__(self,
question_id: str,
question: str,
answers: Dict[str, float],
image_id: str,
image_path: str) -> None:
super().__init__(question_id,
question,
answers,
image_id,
image_path)
# add your question preprocessing function
self._question = preprocess_question(question)
@property
def image(self) -> np.ndarray:
if isinstance(self._img, type(None)):
self._img = load_img(self._image_path)
return self._img
@image.setter
def image(self, image: np.ndarray):
self._img = image
# reset image features, since image updated
self._img_feats = None
@property
def question_tokenized(self) -> List[str]:
return self._question.split()
Create new Diagnostic Datasets¶
An object of DiagnosticDataset
requires the path to the image directory, a name for the dataset, and a dictionary
that contains a mapping of classifier index to the natural language answer string.
From the __getitem__
accessor, an instance of your custom vqa_benchmarking_backend.datasets.dataset.DataSample
class (as created above in the class MyDataSample
) should be returned.
Here, the constructor loads all data using the _load_data
method. You should create your own data loading function to match the data format for your dataset.
The data
property should be a list with objects of MyDataSample
for each data entry from the original data format.
The following code block contains an exemplary vqa_benchmarking_backend.datasets.dataset.DiagnosticDataset
.
from vqa_benchmarking_backend.datasets.dataset import DiagnosticDataset
from vqa_benchmarking_backend.utils.vocab import Vocabulary
class MyDataset(DiagnosticDataset):
def __init__(self,
question_file: str,
img_dir: str,
idx2ans: Dict[int, str],
name: str) -> None:
self.img_dir = img_dir
self.idx2ans = idx2ans
self.name = name
self.data, self.qid_to_sample, self.q_vocab, self.a_vocab = self._load_data(question_file)
def _load_data(self, question_file: str) -> Tuple[List[DataSample], Dict[str, DataSample], Vocabulary, Vocabulary]:
data = []
qid_to_sample = {}
answer_vocab = Vocabulary(itos={}, stoi={})
question_vocab = Vocabulary(itos={}, stoi={})
# load questions
ques = json.load(open(question_file))
for qid in tqdm(ques):
iid = str(ques[qid]['imageId'])
sample = MyDataSample(question_id=qid,
question=ques[qid]['question'],
answers={ques[qid]['answer']: 1.0},
image_id=iid,
image_path=os.path.join(self.img_dir, f"{iid}.jpg"))
answer_vocab.add_token(ques[qid]['answer'])
for token in sample.question_tokenized:
question_vocab.add_token(token)
qid_to_sample[qid] = sample
data.append(qid_to_sample[qid])
return data, qid_to_sample, question_vocab, answer_vocab
def __getitem__(self, index) -> DataSample:
return self.data[index]
def label_from_class(self, class_index: int) -> str:
return self.a_vocab.itos(class_index)
def word_in_vocab(self, word: str) -> bool:
return self.q_vocab.exists(word)
def __len__(self):
return len(self.data)
def get_name(self) -> str:
# Needed for file caching
return self.name
def index_to_question_id(self, index) -> str:
return self.data[index].question_id
def class_idx_to_answer(self, class_idx: int) -> Union[str, None]:
if isinstance(next(iter(self.idx2ans.keys())), int):
if class_idx in self.idx2ans:
return self.idx2ans[class_idx]
else:
if str(class_idx) in self.idx2ans:
return self.idx2ans[str(class_idx)]
return None