Integrate new Datasets
=======================

This document provides a brief overview how to integrate a new benchmarking dataset.

We provide two classes that the new dataset needs to inherit from:

- ``DataSample``
- ``DiagnosticDataset``

Create new Data Samples
-------------------------

Each sample of a dataset is represented as an object of type ``vqa_benchmarking_backend.datasets.dataset.DataSample``.
It stores all the relevant information, like the id's for the question and image, the tokenized question, the corresponding answer, 
and the path to the image.

The following code block contains an exemplary ``DataSample``

.. code-block:: python

    from vqa_benchmarking_backend.datasets.dataset import DataSample

    class MyDataSample(DataSample):
        def __init__(self, 
                     question_id: str, 
                     question: str, 
                     answers: Dict[str, float], 
                     image_id: str, 
                     image_path: str) -> None:

            super().__init__(question_id, 
                             question, 
                             answers, 
                             image_id, 
                             image_path)
            # add your question preprocessing function
            self._question = preprocess_question(question)

        @property
        def image(self) -> np.ndarray:
            if isinstance(self._img, type(None)):
                self._img = load_img(self._image_path)
            return self._img
        
        @image.setter
        def image(self, image: np.ndarray):
            self._img = image
            # reset image features, since image updated
            self._img_feats = None

        @property
        def question_tokenized(self) -> List[str]:
            return self._question.split()

Create new Diagnostic Datasets
--------------------------------

An object of ``DiagnosticDataset`` requires the path to the image directory, a name for the dataset, and a dictionary 
that contains a mapping of classifier index to the natural language answer string.
From the ``__getitem__`` accessor, an instance of your custom ``vqa_benchmarking_backend.datasets.dataset.DataSample`` class (as created above in the class ``MyDataSample``) should be returned.
Here, the constructor loads all data using the ``_load_data`` method. You should create your own data loading function to match the data format for your dataset.
The ``data`` property should be a list with objects of ``MyDataSample`` for each data entry from the original data format.

The following code block contains an exemplary ``vqa_benchmarking_backend.datasets.dataset.DiagnosticDataset``.

.. code-block:: python

    from vqa_benchmarking_backend.datasets.dataset import DiagnosticDataset
    from vqa_benchmarking_backend.utils.vocab import Vocabulary

    class MyDataset(DiagnosticDataset):
        def __init__(self, 
                     question_file: str, 
                     img_dir: str,
                     idx2ans: Dict[int, str],
                     name: str) -> None:

            self.img_dir      = img_dir
            self.idx2ans      = idx2ans
            self.name         = name
    
            self.data, self.qid_to_sample, self.q_vocab, self.a_vocab = self._load_data(question_file)
    
        def _load_data(self, question_file: str) -> Tuple[List[DataSample], Dict[str, DataSample], Vocabulary, Vocabulary]:
            data = []
            qid_to_sample = {}
            answer_vocab = Vocabulary(itos={}, stoi={})
            question_vocab = Vocabulary(itos={}, stoi={})
            # load questions
            ques = json.load(open(question_file))
            for qid in tqdm(ques):
                iid = str(ques[qid]['imageId'])
                sample = MyDataSample(question_id=qid,
                                      question=ques[qid]['question'], 
                                      answers={ques[qid]['answer']: 1.0},
                                      image_id=iid,
                                      image_path=os.path.join(self.img_dir, f"{iid}.jpg"))
                answer_vocab.add_token(ques[qid]['answer'])
                for token in sample.question_tokenized:
                    question_vocab.add_token(token)
                qid_to_sample[qid] = sample
                data.append(qid_to_sample[qid])
            
            return data, qid_to_sample, question_vocab, answer_vocab
        
        def __getitem__(self, index) -> DataSample:
            return self.data[index]
    
        def label_from_class(self, class_index: int) -> str:
            return self.a_vocab.itos(class_index)
        
        def word_in_vocab(self, word: str) -> bool:
            return self.q_vocab.exists(word)
    
        def __len__(self):
            return len(self.data)
        
        def get_name(self) -> str:
            # Needed for file caching
            return self.name
    
        def index_to_question_id(self, index) -> str:
            return self.data[index].question_id
        
        def class_idx_to_answer(self, class_idx: int) -> Union[str, None]:
            if isinstance(next(iter(self.idx2ans.keys())), int):
                if class_idx in self.idx2ans:
                    return self.idx2ans[class_idx]
            else:
                if str(class_idx) in self.idx2ans:
                    return self.idx2ans[str(class_idx)]
            return None