Semantic Enrichment Component


Semantic Enrichment Component (SEC) is part of the system 4A, connecting several of its parts. Provides semantic text enrichment services, for example text annotation annotate, and listing of all types and attributes in KB get_entity_types_and_attributes.

Service is publicly available at http://sec.fit.vutbr.cz/ on port 8082 (Protocol Documentation).

Table of Contents

1 Prerequisities

The current version is available on git in branch D114-SEC_API:

 git clone http://sec.fit.vutbr.cz/sec/secapi.git secapi && cd secapi
 git checkout -b D114-SEC_API origin/D114-SEC_API
        

This will create secapi directory, move to this directory. Download KB in the ./NER directory and then move to ./SEC_API, where the necessary programs will be compiled using the make command.

 cd ./NER && ./deleteKB.sh && ./downloadKB.sh)
 cd ./SEC_API && make)
        

It is necessary to be aware of the fact that when using script downloadKB.sh, KB and machines (*.fsa) cannot be located in the directories secapi/NER and secapi/NER/figa. It is advised to delete them using the deleteKB.sh script. Beware, scripts downloadKB.sh and deleteKB.sh have to be launched only from the directory in which they are located (thus ./NER)!


2 Description of parts

SEC can be found in the directory ./SEC_API (it is the working path — relative paths will be derived from there). Its scripts are written so that they can be called from any directory. At the moment SEC is divided into scripts sec_daemon.py, sec.py and sec_api.py.

2.1 Script sec_daemon.py

Script sec_daemon.py is the core of SEC. It has been made to reduce memory demands while running sec.py in parallel. By launching this script Unix domain socket (UDS) is created and is waiting for connection of several instances of scripts sec.py or sec_api.py. instances of these two scripts communicate with sec_daemon.py using the internal communication protocol described below.

2.1.1 Usage

 ./sec_daemon.py [-h] [-p PATH] [--own_kb_daemon]
 Optional arguments:
   -h, --help            shows help and then terminates.
   -p PATH, --uds_path PATH
                         Sets path to Unix domain socket, where daemon is 
                         waiting for clients. Default value is ./daemon_uds 
                         comparatively to script's directory.
   --own_kb_daemon       Launches its own KB daemon even if any other is already running.
        

2.2 Script sec.py

Script sec.py is client of deamon sec_daemon.py. It is used to present the SEC services to the user. On standard input requirement in JSON is expected. Answer is passed to standard output. Description of services and requirements with examples can be seen in ./doc/sec_api.pdf after compilation by command make.

2.2.1 Usage

 ./sec.py [-h] [-t [DIRECTORY]] [-p PATH] [-c CONFIG.json] [--plaintext]
          [-f FILENAME]
        
 Optional arguments:
   -h, --help            shows help and then terminates.
   -t [DIRECTORY], --testing_mode [DIRECTORY]
                         Switches to test mode, which will allow to check work 
                         with structured annotations that NER is not familiar with. 
                         Meaning - service "annotate" is looking in URI query
                         "DOCUMENT_URI" for value of key "tid" and according to this 
                         looks in directory DIRECTORY for file with answer to
                         "annotation_format". If such file is found, instead of 
                         results from NER, its content will be returned. URI
                         query "DOCUMENT_URI" can contain key "aid". Unlike key
                         "tid", content of file found accordingly to this value,
                         the result of NER will be only enriched (connected to it).
                         Default value is ./testing_mode comparatively to script's 
                         directory.
   -p PATH, --uds_path PATH
                         Sets path to Unix domain socket, where daemon is 
                         waiting for clients. Default value is ./daemon_uds 
                         comparatively to script's directory.
   -c CONFIG.json, --config_file CONFIG.json
                         Sets the service and its parameters from JSON file, 
                         instead of standard input. In this case just a text to 
                         be processed or nothing is expected on standard input.
   --plaintext           Output of services "annotate", "annotate_vertical"
                         and "get_raw_annotations" is a plain text. If an exception 
                         occurs, it remains in JSON.
   -f FILENAME, --filename FILENAME
                         Sets filename for service "annotate_vertical".
        

2.3 Script sec_api.py

Script sec_api.py is very similar to script sec.py and that is why it uses its part. Unlike it, more requests can be entered on standard input per one instance. Each request will print the answer out to standard output. Server using HTTP protocol will be created as well during launching, waiting on port 8082. Any HTTP client can send a SEC request through this script for a specific service via HTTP request POST and get a response.

2.3.1 Usage

 ./sec_api.py [-h] [-t [DIRECTORY]] [-p PATH] [-n PORT]
 Optional arguments:
   -h, --help            Shows help and then terminates.
   -t [DIRECTORY], --testing_mode [DIRECTORY]
                         Switches to test mode, which will allow to check work 
                         with structured annotations that NER is not familiar with. 
                         Meaning - service "annotate" is looking in URI query
                         "DOCUMENT_URI" for value of key "tid" and according to this 
                         looks in directory DIRECTORY for file with answer to
                         "annotation_format". If such file is found, instead of 
                         results from NER, its content will be returned. URI
                         query "DOCUMENT_URI" can contain key "aid". Unlike key
                         "tid", content of file found accordingly to this value,
                         the result of NER will be only enriched (connected to it).
                         Default value is ./testing_mode comparatively to script's 
                         directory.
   -p PATH, --uds_path PATH
                         Sets path to Unix domain socket, where daemon is 
                         waiting for clients. Default value is ./daemon_uds 
                         comparatively to script's directory.
   -n PORT, --net_port PORT
                         Sets port, where SEC is waiting for clients. Default value
                         is 8082.

        

2.4 Internal communication protocol

Internal communication protocol is based on model client-server using Unix domain socket (UDS) in stream mode.


Key points include:

2.4.1 Procedure

  1. Server is waiting for clients.
  2. Client connects.
  3. Client sends the settings to server (directory to the test mode and JSON with setting of the required service).
  4. Server receives the settings and gives client confirmation.
  5. Client sends data to be processed to server (if required service does not require any data, the data may be equal to zero).
  6. Server can send client a request (within the test mode) to open several files and send its file descriptor (this also demonstrates the client's permission to open the file).
  7. Server sends processed data to client.
  8. Client closes connection or continues to point no. 5, respectively point no. 3.
If the server detects incorrect settings or an error occurs during processing, the client is sent the error information and the connection is terminated.

2.4.2 Commands and packet structure

You can check commands in file daemon_lib.py. They have dynamically generated two-digit number Opcode.

For commands two packet structures are being used. For errors it is:

  2 bytes        String      2 bytes
  -----------------------------------
 | Opcode |  Error message  |  CRLF  |
  -----------------------------------
        

For the rest (except for file descriptor) it is a structure that is being repeated until the number of bytes is equal to zero:

  2 bytes        Number (decimal)        2 bytes    N bytes    2 bytes
  ---------------------------------------------------------------------
 | Opcode |  Number of bytes of data N  |  CRLF  |  Raw data  |  CRLF  |
  ---------------------------------------------------------------------
        

Library python-fdsend is being used to send file descriptors.


3 Multiple NERs support

In development - documentation will be completed later (contains only essential facts at the moment):

       ner_manager.appendNER("default", module_annotate.NER())
        

Similar line with another name of NER and instance of another wrap

3.1 Output specification from NERs

Specification is created according to our NER and other requests. At the output from NERs is expected this syntax (BNF):

 <output from NERs> ::= <origin_base>
     | <origin_base> "\t" <id>
     | <origin_base> "\t" <id> "\t" <direct_attributes>
 <origin_base> ::= <start_offset> "\t" <end_offset> "\t" <data_type> "\t" <string_between_offsets> "\t" <data>
 <data_type> ::= "kb"
     | "activity"
     | "date"
     | "interval"
     | "coref"
     | "uri"
 <data> ::= <data-kb>
     | <data-activity>
     | <data-date>
     | <data-interval>
     | <data-coref>
     | <data-uri>
 <data-kb> ::= <KB_row> | <KB_row> ";" <data-kb>
 <data-date> ::= <year> "-" <month> "-" <day>
 <data-interval> ::= <data-date> " -- " <data-date>
 <data-coref> ::= <data-kb>
 
 <direct_attributes> ::= <attribute> | <attribute> "|" <direct_attributes>
 <attribute> ::= <attribute_name> "[" <attribute_type> "]=" <attribute_value>
 <attribute_type> ::= "string" | "decimal" | "date" | "image" | "integer" | "uri" | <other_attribute_type> 
 
 <year> ::= <digit> <digit> <digit> <digit>
 <month> ::= <digit> <digit>
 <day> ::= <digit> <digit>
 <digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
        

where:


4 Calling SEC as a library

In order to not need to create subprocess with sec.py while using in SEC by another program - class Sec in sec.py has been created. Its methods are described in the source code. Like with sec.py it is necessary to have script sec_daemon.py launched and to initialize this class with path to Unix domain socket, where daemon is waiting. Configurations are defined similarly to sec.py, the only difference is that instead of JSON an alternative of Python is being used (see table).


5 Launching on grid

For launching of SEC on grid (SGE) script ./sge/sec.sh has been created.

Several requirements were placed:

Final ./sge/sec.sh accepts the same arguments as sec.py. Even though it was designed for launching on grid, it is possible to use it on ordinary machines (to be sure).

Within this aim switch --own_kb_daemon has been created at sec_daemon.py and --plaintext at sec.py. For this purpose ability to change name of shared memory by an argument of program in KB of deamon has been created.

5.1 Using in the manner of NER

To use SEC with stdin/stdout of NER you can use service get_raw_annotations. It is necessary to create a configuration file (for example "get_raw_annotations.cfg"), e.g. with:

 {
     "get_raw_annotations": {}
 }
        

Then NER can be called via SEC using this command:

 ./sge/sec.sh -c get_raw_annotations.cfg --plaintext
        

6 Launching on Salomon

To launch on supercomputer Salomon (IT4I) - scripts in directory ./salomon have been created.

SEC is dependent on several libraries that are not installed on Salomon. It is necessary to copy them from knot09:/mnt/minerva1/nlp/projects/corpproc/dependencies_for_salomon/opt. It can be done e.g. this way:

 $ mkdir -p ~/mnt/ssh-knot-knot09
 $ sshfs xlogin01@knot09.fit.vutbr.cz:/ ~/mnt/ssh-knot-knot09/
 $ cp -r ~/mnt/ssh-knot-knot09/mnt/minerva1/nlp/projects/corpproc/dependencies_for_salomon/opt ~/
 $ fusermount -u ~/mnt/ssh-knot-knot09
        

Dependencies are already assembled. If a new complilation would be necessary, launch ./salomon/prepare.sh.

To launch use one of the several variations of script ./salomon/start.sh. Each variant expects:

       $ ls ~/parsed | sed 's/\.vert.*//g' > ~/namelist
        

created upon launch:

6.1 Variants

  1. Variant ./salomon/start.sh will launch instance ./sge/sec.sh separately on one node for each file from ~/namelist.
  2. Variant ./salomon/v2/start.sh requires argument defining number of jobs per node. According to the number of jobs and the number of files in ~/namelist, necessary number of jobs will be created, these jobs will occupy all nodes available by user per one job according to limits.

7 Provided services

7.1 Service annotate

This service returns annotations for the specified document. It uses enrichment engine chosen by user. Annotations include information about their location in a document (start and end offset), lenght and annotated text itself. It also contains information obtained from KB, including e.g. type, name and URL on wikipedia.

Enrichment engine is chosen by user in parameter "enrichment_engine". User can also assign its maximal processing time using the "enrichment_engine_timeout" parameter. To print all enrichment engines you can use the "get_enrichment_engines" service.

Text in the document is usually ambiguous and that is why enrichment engine might find multiple entities to the particular text. If parameter "disambiguate" is set, then enrichment engine will select the most probable meaning of annotated text.

Output format can be chosen using the "annotation_format" parameter. It is possible to choose multiple output formats for one input. Note: This might be edited later on in order to have always correct JSON as output when "plaintext": false. Parameter "annotation_format" can have these values:

Using the "types_and_attributes" parameter you can specify what information from KB will be included to the output. It is possible to allow specific types and all of their attributes (syntax { str(type): "all" }) or some of them (syntax { str(type): [ str(attribute), ... ] }). Its default value is "all", which means that statement with all types of annotations and its attributes is allowed. All available types and their attributes can be printed using the "get_entity_types_and_attributes" service.

Parameter "document_uri" is used to enter the URL from which the document was taken over. If output format NIF is set, this parameter is required.

If parameter "plaintext" is set to true, encapsulation of output to JSON is canceled. In this case various output formats are separated by character '\0'.

Samples and more information can be found here.

7.1.1 Request format

 {
     "annotate": {
         "input_text": str,
         "annotation_format": [ str, ... ],
         "disambiguate": int,
         "document_uri": str,
         "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] },
         "enrichment_engine": str,
         "enrichment_engine_timeout": int,
         "plaintext": bool
     }
 }
        

7.1.2 Answer format

 {
     "annotation": str
 }
        

7.1.3 Output format

Output format of this service can be chosen by parameter "annotation_format". These formats are described below.

7.1.3.1 SXML

XML document including annotated text only. It is designed mainly for further processing.

 <?xml version="1.0" encoding="UTF-8"?>
 <!-- Generated using: trang -I xml -O rng *.sxml SXML.rng -->
 <grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <element name="suggestion">
      <zeroOrMore>
        <element name="text">
          <attribute name="e_offset">
            <data type="integer"/>
          </attribute>
          <attribute name="s_offset">
            <data type="integer"/>
          </attribute>
          <attribute name="string"/>
          <zeroOrMore>
            <element name="annotation">
              <optional>
                <attribute name="id">
                  <data type="anyURI"/>
                </attribute>
              </optional>
              <attribute name="type">
                <data type="NCName"/>
              </attribute>
              <zeroOrMore>
                <element name="attribute">
                  <optional>
                    <attribute name="annotType">
                      <data type="NCName"/>
                    </attribute>
                  </optional>
                  <attribute name="name">
                    <data type="NCName"/>
                  </attribute>
                  <attribute name="type">
                    <data type="NCName"/>
                  </attribute>
                  <text/>
                </element>
              </zeroOrMore>
            </element>
          </zeroOrMore>
        </element>
      </zeroOrMore>
    </element>
  </start>
 </grammar>
        
7.1.3.2 XML
7.1.3.3 HTML
7.1.3.4 Text
7.1.3.5 Index
7.1.3.6 Index2
7.1.3.7 RDF
7.1.3.8 NIF

7.2 Service annotate_vertical

Special clone of service "annotate"for annotation of vertical. (For this reason, I will describe only the difference.) One of it's parts - the "deverticalize" service takes care of gradual obtaining of individual documents from the input in vertical format. Output format must be specified in the request.

7.2.1 Request format

 {
    "annotate_vertical": {
        "input_text": str,
        "annotation_format": str,
        "vert_in_cols": [ str, ... ],
        "vert_out_cols": [ str, ... ],
        "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] },
        "enrichment_engine": str,
        "enrichment_engine_timeout": int,
        "filename": str,
        "num_workers": int,
        "plaintext": bool,
        "max_values_per_col": int | null,
        "wiki_mode": bool,
        "enable_figa": bool
    }
 }
        

7.2.2 Answer format

 {
     "annotation": str | [
         {
             "title": str,
             "uri": str,
             "article": str
         },
         ...
     ]
 }
        

In case that "annotation_format" will be "elasticsearch", value of key "annotation" will be a list of objects, otherwise it will be a string.

7.3 Service deverticalize

Deverticalizes text in vertical format (see https://www.sketchengine.co.uk/documentation/preparing-corpus-text/ or http://nlp.fi.muni.cz/cs/PopisVertikalu).

7.3.1 Request format

 {
     "deverticalize": {
         "input_text": str,
         "vert_in_cols": [ str, ... ]
     }
 }
        

7.3.2 Answer format

 {
     "deverticalized": [
         {
             "id": str,
             "document": str
         },
         ...
     ]
 }
        

7.3.3 Errors

7.4 Service get_enrichment_engines

Lists all available enrichment engines, these are the values that attribute "enrichment_engine" (used at some services) can have.

7.4.1 Request format

 {
     "get_enrichment_engines": {}
 }
        

7.4.2 Answer format

 {
     "enrichment_engines": [ str, ... ]
 }
        

7.5 Service get_entities

Each entity from KB with the same or similar name as the name specified in attribute "input_string" will be printed. Output is ordered according to the value of attribute of entity "confidence". You can filter as well as in service "annotate" using the "types_and_attributes" attribute.

Samples and more information can be found here.

7.5.1 Request format

 {
     "get_entities": {
         "input_string": str,
         "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] },
         "max_results": int
     }
 }
        

7.5.2 Answer format

 {
     "data": [
         {
             str(type): {
                 str(attribute): str,
                 ...
             }
         },
         ...
     ]
 }
        

7.6 Service get_entity_by_uri

Search for entities by URI.

7.6.1 Request format

 {
     "get_entity_by_uri": {
         "input_string": str,
         "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] }
     }
 }
        

7.6.2 Answer format

 {
     "data": [
         {
             str(type): {
                 str(attribute): str,
                 ...
             }
         },
         ...
     ]
 }
        

7.7 Service get_entity_types_and_attributes

Lists all the available types and their attributes. This information can be used at attribute "types_and_attributes" that is used as a filter for certain services.

7.7.1 Request format

 {
     "get_entity_types_and_attributes": {}
 }
        

7.7.2 Answer format

 {
     "data": [
         {
             "type": str(type),
             "attributes": [
                 str(attribute),
                 ...
             ]
         },
         ...
     ]
 }
        

7.8 Service get_kb_version

Returns the version number of loaded KB.

7.8.1 Request format

 {
     "get_kb_version": {}
 }
        

7.8.2 Answer format

 {
     "version": int
 }
        

7.9 Service get_raw_annotations

Returns string obtained by NER.

7.9.1 Request format

 {
     "get_raw_annotations": {
         "input_text": str,
         "disambiguate": int,
         "enrichment_engine": str,
         "enrichment_engine_timeout": int,
         "plaintext": bool
     }
 }
        

7.9.2 Answer format

 {
     "annotation": str
 }
        

8 Some of the generated attributes for types

"confidence"

"identifier"

"disambiguation"

 <disambiguation> ::= <text in URI between brackets> "," <description> "(" <interval of living> ")"
                    | <text in URI between brackets> "(" <interval of living> ")"
                    | <text in URI between brackets>
                    | <description> "(" <interval of living> ")"
                    | <description>
                    | <interval of living>
                    | ""

 <interval of living> ::= <YYYY-MM-DD date of birth> -- <YYYY-MM-DD date of death>
                   | "born " <YYYY-MM-DD date of birth>
        

9 References

9.1 See also

9.2 Manatee

9.3 MG4J

9.4 External links