A core region captioning framework for automatic video understanding in story video contents
Due to the rapid increase in images and image data, research examining the visual analysis of such unstructured data has recently come to be actively conducted.One of the representative image caption models the DenseCap model extracts various regions in an image and generates region-level captions.However, since the existing DenseCap model does not