NLP_ThumbnailAnnotator

M.Sc. Project in Language Technology at Uni Hamburg from Florian Schneider


Project maintained by uhh-lt Hosted on GitHub Pages — Theme by mattgraham

Developer Guide

This guide will show you the basic architecture and most important components of the REST API to extend or build on it.

Software Architecure of the API

The software is written in Java 8 using several Java Frameworks such as UIMA wrapped by DKPro Core, DKPro WSD and Spring-Boot and is structured as a multi-module Maven project. The parent module pom.xml holds the basic configuration, properties, plugins & dependencies which are required in all of the three modules and is located in the thumbnailAnnotator.parent directory. This directory, which is the is the root directory for the whole project, holds the three main modules and should be imported as a Maven project from the IDE of your choice. A diagram of how the modules interact is shown below.

Core Module

This module holds the business logic as well as the domain model. The module is further devided into multiple packages:

Domain Package

This package holds the domain model. All of the POJO classes in this package inherit from the DomainObject class, to indicate that they are part of the domain model. The main components of the domain model are the CaptionToken, the Thumbnail, the ExtractorResult and the CrawlerResult.

CaptionTokenExtractor Package

CaptionTokenExtractor

This package contains the CaptionTokenExtractor - the main component to extract CaptionTokens from a UserInput. It is designed as a Singleton class and it’s core functionallity is implemented using the UIMA Framework wrapped by the DKPro Core Framework and DKPro WSD. The CaptionTokenExtractor contains a managed ExecutorService since the extraction happens in parallel. For each UserInput an ExtractorAgent gets instantiated, which extracts the CaptionTokens.

To extract CaptionTokens, an aggregated AnalysisEngine is used to create CaptionTokenAnnotations, which are then transformed to CaptionTokens. This aggregated AnalysisEngine consits of the following Annotators, where the Annotators 6. - 9. are a custom Annotators and get’s created by the CaptionTokenExtractor.

  1. OpenNlpSegmenter
  2. ClearNlpPosTagger
  3. ClearNlpLemmatizer
  4. MaltParser
  5. OpenNlpNamedEntityRecognizer with variants
    • location
    • person
    • organization
  6. PosExclusionFlagTokenAnnotator
  7. NamedEntityCaptionTokenAnnotator
  8. PosViewCreator
  9. NounCaptionTokenAnnotator

Custom Annotators

The following custom Annotators can be found in the nlp.floschne.thumbnailAnnotator.core.captionTokenExtractor.annotator package.

Custom Annotations

This Annotations are described by XML files located at src/main/resources/desc.type and are generated by the JCasGen Maven Plugin during the generation phase of Maven.

ThumbnailCrawler Package

ThumbnailCrawler

This package contains the ThumbnailCrawler - a Singleton class to search the Thumbnails for a given CaptionToken. This is done pretty straightforward by quering a IThumbnailsource with a CaptionToken. In the current version there is one implementation of the interface - the ShutterstockSource - but additional implementations could be used very easy. In order to receive a specified number of URLs to the images, the value of the CaptionToken is used as query parameter to make a HTTP GET request to the Shutterstock REST API. If the number of returned results is less than the specified, only the head/last Token of the CaptionToken is used since it holds the most general description. Then, for each of those URLs a Thumbnail get’s instantiated and it’s priority is initialized with ‘1’.

The ThumbnailCrawler contains a managed ExecutorService since getting Thumbnails is done in parallel. For each CaptionToken a CrawlerAgent gets instantiated which performs the crawling for that CaptionToken.

DB Module

As the name suggests, in this module the database layer is located. The database used, is the famous NoSQL, in-memory database called Redis. The module uses the simple Key-Value-Store to store the Entitys by an Id of type String. To make things easy and accessible in the API module the Module heavily depends on the Spring-Boot Framework, specially the Spring-Boot-Data-Redis Component. This package has high code coverage since it’s curcible for the overall functionallity of the API.

Redis Configuration

The connection to Redis is configured in the RedisConfig. There are two Spring profiles, that specify the connection to the database when running the API locally or inside docker-compose.

Subpackages

The module has four subpackages:

API Module

This module is responsible to expose the REST API Resources in order to interact with the Thumbnail Annotator. It connects the domain model functionallity of the Core Module with the DB Module. Since this package should hold as less logic as possible there are only three classes. The module also heavily depends on the Spring-Boot Framework, specially the Spring-Boot-Starter-Web Component.

Components

REST API Documentation

The RESTful API is documented by a Swagger-UI implemented with Springfox which can be found when opening localhost:8081 in a browser (assuming the API is running!). A screen shot of the header as well as the methods part of the Swagger-UI can be seen below.