Confluence Of Vision And Natural Language Processing For Cross-Media Semantic Relations Extraction