Semantic Enrichment Component (SEC) is part of the system 4A, connecting several of its parts. Provides semantic text enrichment services, for example text annotation annotate
, and listing of all types and attributes in KB get_entity_types_and_attributes
.
Service is publicly available at http://sec.fit.vutbr.cz/ on port 8082 (Protocol Documentation).
The current version is available on git in branch D114-SEC_API:
git clone http://sec.fit.vutbr.cz/sec/secapi.git secapi && cd secapi git checkout -b D114-SEC_API origin/D114-SEC_API
This will create secapi directory, move to this directory. Download KB in the ./NER directory and then move to ./SEC_API, where the necessary programs will be compiled using the make command.
cd ./NER && ./deleteKB.sh && ./downloadKB.sh) cd ./SEC_API && make)
It is necessary to be aware of the fact that when using script downloadKB.sh, KB and machines (*.fsa) cannot be located in the directories secapi/NER and secapi/NER/figa. It is advised to delete them using the deleteKB.sh script. Beware, scripts downloadKB.sh and deleteKB.sh have to be launched only from the directory in which they are located (thus ./NER)!
SEC can be found in the directory ./SEC_API (it is the working path — relative paths will be derived from there). Its scripts are written so that they can be called from any directory. At the moment SEC is divided into scripts sec_daemon.py, sec.py and sec_api.py.
sec_daemon.py
Script sec_daemon.py is the core of SEC. It has been made to reduce memory demands while running sec.py
in parallel. By launching this script Unix domain socket (UDS) is created and is waiting for connection of several instances of scripts sec.py
or sec_api.py
. instances of these two scripts communicate with sec_daemon.py
using the internal communication protocol described below.
./sec_daemon.py [-h] [-p PATH] [--own_kb_daemon]
Optional arguments: -h, --help shows help and then terminates. -p PATH, --uds_path PATH Sets path to Unix domain socket, where daemon is waiting for clients. Default value is ./daemon_uds comparatively to script's directory. --own_kb_daemon Launches its own KB daemon even if any other is already running.
sec.py
Script sec.py
is client of deamon sec_daemon.py
. It is used to present the SEC services to the user. On standard input requirement in JSON is expected. Answer is passed to standard output. Description of services and requirements with examples can be seen in ./doc/sec_api.pdf after compilation by command make.
./sec.py [-h] [-t [DIRECTORY]] [-p PATH] [-c CONFIG.json] [--plaintext] [-f FILENAME]
Optional arguments: -h, --help shows help and then terminates. -t [DIRECTORY], --testing_mode [DIRECTORY] Switches to test mode, which will allow to check work with structured annotations that NER is not familiar with. Meaning - service "annotate" is looking in URI query "DOCUMENT_URI" for value of key "tid" and according to this looks in directory DIRECTORY for file with answer to "annotation_format". If such file is found, instead of results from NER, its content will be returned. URI query "DOCUMENT_URI" can contain key "aid". Unlike key "tid", content of file found accordingly to this value, the result of NER will be only enriched (connected to it). Default value is ./testing_mode comparatively to script's directory. -p PATH, --uds_path PATH Sets path to Unix domain socket, where daemon is waiting for clients. Default value is ./daemon_uds comparatively to script's directory. -c CONFIG.json, --config_file CONFIG.json Sets the service and its parameters from JSON file, instead of standard input. In this case just a text to be processed or nothing is expected on standard input. --plaintext Output of services "annotate", "annotate_vertical" and "get_raw_annotations" is a plain text. If an exception occurs, it remains in JSON. -f FILENAME, --filename FILENAME Sets filename for service "annotate_vertical".
sec_api.py
Script sec_api.py
is very similar to script sec.py
and that is why it uses its part. Unlike it, more requests can be entered on standard input per one instance. Each request will print the answer out to standard output. Server using HTTP protocol will be created as well during launching, waiting on port 8082. Any HTTP client can send a SEC request through this script for a specific service via HTTP request POST and get a response.
./sec_api.py [-h] [-t [DIRECTORY]] [-p PATH] [-n PORT]
Optional arguments: -h, --help Shows help and then terminates. -t [DIRECTORY], --testing_mode [DIRECTORY] Switches to test mode, which will allow to check work with structured annotations that NER is not familiar with. Meaning - service "annotate" is looking in URI query "DOCUMENT_URI" for value of key "tid" and according to this looks in directory DIRECTORY for file with answer to "annotation_format". If such file is found, instead of results from NER, its content will be returned. URI query "DOCUMENT_URI" can contain key "aid". Unlike key "tid", content of file found accordingly to this value, the result of NER will be only enriched (connected to it). Default value is ./testing_mode comparatively to script's directory. -p PATH, --uds_path PATH Sets path to Unix domain socket, where daemon is waiting for clients. Default value is ./daemon_uds comparatively to script's directory. -n PORT, --net_port PORT Sets port, where SEC is waiting for clients. Default value is 8082.
Internal communication protocol is based on model client-server using Unix domain socket (UDS) in stream mode.
You can check commands in file daemon_lib.py. They have dynamically generated two-digit number Opcode.
For commands two packet structures are being used. For errors it is:
2 bytes String 2 bytes ----------------------------------- | Opcode | Error message | CRLF | -----------------------------------
For the rest (except for file descriptor) it is a structure that is being repeated until the number of bytes is equal to zero:
2 bytes Number (decimal) 2 bytes N bytes 2 bytes --------------------------------------------------------------------- | Opcode | Number of bytes of data N | CRLF | Raw data | CRLF | ---------------------------------------------------------------------
Library python-fdsend
is being used to send file descriptors.
In development - documentation will be completed later (contains only essential facts at the moment):
ner_manager.appendNER("default", module_annotate.NER())
Similar line with another name of NER and instance of another wrap
Specification is created according to our NER and other requests. At the output from NERs is expected this syntax (BNF):
<output from NERs> ::= <origin_base> | <origin_base> "\t" <id> | <origin_base> "\t" <id> "\t" <direct_attributes> <origin_base> ::= <start_offset> "\t" <end_offset> "\t" <data_type> "\t" <string_between_offsets> "\t" <data> <data_type> ::= "kb" | "activity" | "date" | "interval" | "coref" | "uri" <data> ::= <data-kb> | <data-activity> | <data-date> | <data-interval> | <data-coref> | <data-uri> <data-kb> ::= <KB_row> | <KB_row> ";" <data-kb> <data-date> ::= <year> "-" <month> "-" <day> <data-interval> ::= <data-date> " -- " <data-date> <data-coref> ::= <data-kb> <direct_attributes> ::= <attribute> | <attribute> "|" <direct_attributes> <attribute> ::= <attribute_name> "[" <attribute_type> "]=" <attribute_value> <attribute_type> ::= "string" | "decimal" | "date" | "image" | "integer" | "uri" | <other_attribute_type> <year> ::= <digit> <digit> <digit> <digit> <month> ::= <digit> <digit> <day> ::= <digit> <digit> <digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
where:
In order to not need to create subprocess with sec.py
while using in SEC by another program - class Sec in sec.py
has been created. Its methods are described in the source code. Like with sec.py
it is necessary to have script sec_daemon.py
launched and to initialize this class with path to Unix domain socket, where daemon is waiting. Configurations are defined similarly to sec.py
, the only difference is that instead of JSON an alternative of Python is being used (see table).
For launching of SEC on grid (SGE) script ./sge/sec.sh has been created.
Several requirements were placed:
sec.py
, there will be sec_daemon.py
and shared KB).Final ./sge/sec.sh accepts the same arguments as sec.py
. Even though it was designed for launching on grid, it is possible to use it on ordinary machines (to be sure).
Within this aim switch --own_kb_daemon
has been created at sec_daemon.py
and --plaintext
at sec.py
. For this purpose ability to change name of shared memory by an argument of program in KB of deamon has been created.
To use SEC with stdin/stdout of NER you can use service get_raw_annotations
. It is necessary to create a configuration file (for example "get_raw_annotations.cfg"), e.g. with:
{ "get_raw_annotations": {} }
Then NER can be called via SEC using this command:
./sge/sec.sh -c get_raw_annotations.cfg --plaintext
To launch on supercomputer Salomon (IT4I) - scripts in directory ./salomon have been created.
SEC is dependent on several libraries that are not installed on Salomon. It is necessary to copy them from knot09:/mnt/minerva1/nlp/projects/corpproc/dependencies_for_salomon/opt
. It can be done e.g. this way:
$ mkdir -p ~/mnt/ssh-knot-knot09 $ sshfs xlogin01@knot09.fit.vutbr.cz:/ ~/mnt/ssh-knot-knot09/ $ cp -r ~/mnt/ssh-knot-knot09/mnt/minerva1/nlp/projects/corpproc/dependencies_for_salomon/opt ~/ $ fusermount -u ~/mnt/ssh-knot-knot09
Dependencies are already assembled. If a new complilation would be necessary, launch ./salomon/prepare.sh.
To launch use one of the several variations of script ./salomon/start.sh. Each variant expects:
$ ls ~/parsed | sed 's/\.vert.*//g' > ~/namelist
created upon launch:
annotate
This service returns annotations for the specified document. It uses enrichment engine chosen by user. Annotations include information about their location in a document (start and end offset), lenght and annotated text itself. It also contains information obtained from KB, including e.g. type, name and URL on wikipedia.
Enrichment engine is chosen by user in parameter "enrichment_engine"
. User can also assign its maximal processing time using the "enrichment_engine_timeout"
parameter. To print all enrichment engines you can use the "get_enrichment_engines"
service.
Text in the document is usually ambiguous and that is why enrichment engine might find multiple entities to the particular text. If parameter "disambiguate"
is set, then enrichment engine will select the most probable meaning of annotated text.
Output format can be chosen using the "annotation_format"
parameter. It is possible to choose multiple output formats for one input. Note: This might be edited later on in order to have always correct JSON as output when "plaintext": false. Parameter "annotation_format"
can have these values:
"disambiguate"
: 0 will cause an error.Using the "types_and_attributes"
parameter you can specify what information from KB will be included to the output. It is possible to allow specific types and all of their attributes (syntax { str(type): "all" }
) or some of them (syntax { str(type): [ str(attribute), ... ] }
). Its default value is "all"
, which means that statement with all types of annotations and its attributes is allowed. All available types and their attributes can be printed using the "get_entity_types_and_attributes"
service.
Parameter "document_uri"
is used to enter the URL from which the document was taken over. If output format NIF is set, this parameter is required.
If parameter "plaintext"
is set to true, encapsulation of output to JSON is canceled. In this case various output formats are separated by character '\0'
.
Samples and more information can be found here.
{ "annotate": { "input_text": str, "annotation_format": [ str, ... ], "disambiguate": int, "document_uri": str, "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] }, "enrichment_engine": str, "enrichment_engine_timeout": int, "plaintext": bool } }
["disambiguate", "types_and_attributes", "document_uri", "enrichment_engine", "enrichment_engine_timeout", "plaintext"]
."document_uri"
is required if values of attribute "annotation_format" include "NIF".{ "annotation": str }
Output format of this service can be chosen by parameter "annotation_format"
. These formats are described below.
XML document including annotated text only. It is designed mainly for further processing.
<?xml version="1.0" encoding="UTF-8"?> <!-- Generated using: trang -I xml -O rng *.sxml SXML.rng --> <grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <start> <element name="suggestion"> <zeroOrMore> <element name="text"> <attribute name="e_offset"> <data type="integer"/> </attribute> <attribute name="s_offset"> <data type="integer"/> </attribute> <attribute name="string"/> <zeroOrMore> <element name="annotation"> <optional> <attribute name="id"> <data type="anyURI"/> </attribute> </optional> <attribute name="type"> <data type="NCName"/> </attribute> <zeroOrMore> <element name="attribute"> <optional> <attribute name="annotType"> <data type="NCName"/> </attribute> </optional> <attribute name="name"> <data type="NCName"/> </attribute> <attribute name="type"> <data type="NCName"/> </attribute> <text/> </element> </zeroOrMore> </element> </zeroOrMore> </element> </zeroOrMore> </element> </start> </grammar>
annotate_vertical
Special clone of service "annotate"
for annotation of vertical. (For this reason, I will describe only the difference.) One of it's parts - the "deverticalize"
service takes care of gradual obtaining of individual documents from the input in vertical format. Output format must be specified in the request.
{ "annotate_vertical": { "input_text": str, "annotation_format": str, "vert_in_cols": [ str, ... ], "vert_out_cols": [ str, ... ], "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] }, "enrichment_engine": str, "enrichment_engine_timeout": int, "filename": str, "num_workers": int, "plaintext": bool, "max_values_per_col": int | null, "wiki_mode": bool, "enable_figa": bool } }
["vert_in_cols", "vert_out_cols", "types_and_attributes", "enrichment_engine", "enrichment_engine_timeout", "filename", "num_workers", "plaintext"]
.["mg4j", "manatee", "manatee2", "elasticsearch"]
.
"vert_in_cols"
, respectively "vert_out_cols"
determines the meaning of each invidivual column in vertical on input. If it is not set, meaning is standard ⇒ first 13 MG4J columns. The previously used attribute "vert_cols"
is still functional, but it is not used anymore. The names of columns are listed in the header of MG4J format."filename"
sets name of source file. It is used in the output format MG4J."num_workers"
sets number of processed documents in parallel from vertical. Default value is number of processes."max_values_per_col"
sets maximum number of values per column. It is used in the output format MG4J. Default value is 4."wiki_mode"
allows use of column "docuri" to process verticals from wikipedia. It works like this:
"enable_figa" == true
then it is looking for a entity in KB according to URL in hypertext link:
"wiki_mode": false
and "enable_figa": true
.{ "annotation": str | [ { "title": str, "uri": str, "article": str }, ... ] }
In case that "annotation_format"
will be "elasticsearch"
, value of key "annotation"
will be a list of objects, otherwise it will be a string.
deverticalize
Deverticalizes text in vertical format (see https://www.sketchengine.co.uk/documentation/preparing-corpus-text/ or http://nlp.fi.muni.cz/cs/PopisVertikalu).
{ "deverticalize": { "input_text": str, "vert_in_cols": [ str, ... ] } }
["vert__in_cols"]
."vert_in_cols"
determines meaning of each individual column in input vertical. If it is not set, meaning is standard ⇒ first 13 MG4J columns. The previously used attribute "vert_cols"
is still functional, but it is not used anymore.{ "deverticalized": [ { "id": str, "document": str }, ... ] }
<s></s>
and with tags <g/>
then the tag <g/>
will not stick together (e.g.: "<s>\nHello\n</g>\n!</s><s></g>\n!</s><s></g>\n!</s>"
⇒ "Hello!\n!\n!\n"
) because of <s>
.get_enrichment_engines
Lists all available enrichment engines, these are the values that attribute "enrichment_engine"
(used at some services) can have.
{ "get_enrichment_engines": {} }
{ "enrichment_engines": [ str, ... ] }
get_entities
Each entity from KB with the same or similar name as the name specified in attribute "input_string"
will be printed. Output is ordered according to the value of attribute of entity "confidence"
. You can filter as well as in service "annotate"
using the "types_and_attributes"
attribute.
Samples and more information can be found here.
{ "get_entities": { "input_string": str, "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] }, "max_results": int } }
["max_results"]
."input_string"
sets name of the wanted entity or the initial part of the name ending with an asterisk '*'."types_and_attributes"
filters in the exactly same way as service "annotate"
."max_results"
sets the maximum number of entities in output. Default value is 10.{ "data": [ { str(type): { str(attribute): str, ... } }, ... ] }
get_entity_by_uri
Search for entities by URI.
{ "get_entity_by_uri": { "input_string": str, "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] } } }
{ "data": [ { str(type): { str(attribute): str, ... } }, ... ] }
get_entity_types_and_attributes
Lists all the available types and their attributes. This information can be used at attribute "types_and_attributes"
that is used as a filter for certain services.
{ "get_entity_types_and_attributes": {} }
{ "data": [ { "type": str(type), "attributes": [ str(attribute), ... ] }, ... ] }
get_kb_version
Returns the version number of loaded KB.
{ "get_kb_version": {} }
{ "version": int }
get_raw_annotations
Returns string obtained by NER.
{ "get_raw_annotations": { "input_text": str, "disambiguate": int, "enrichment_engine": str, "enrichment_engine_timeout": int, "plaintext": bool } }
["disambiguate", "enrichment_engine", "enrichment_engine_timeout", "plaintext"]
.{ "annotation": str }
"identifier"
"disambiguation"
<disambiguation> ::= <text in URI between brackets> "," <description> "(" <interval of living> ")" | <text in URI between brackets> "(" <interval of living> ")" | <text in URI between brackets> | <description> "(" <interval of living> ")" | <description> | <interval of living> | "" <interval of living> ::= <YYYY-MM-DD date of birth> -- <YYYY-MM-DD date of death> | "born " <YYYY-MM-DD date of birth>
["wikipedia_url", "date_of_birth", "date_of_death", "description"]
.
"wikipedia_url"
contains parentheses (thus ["(", ")", "%28", "%29"]
), then it reproduces the text between them - underscores are replaced with spaces. Above in BNF as <text in URI between brackets>."description"
contains a substring starting with "is|was a|an|the"
, then its D_DESC_MAX_WORDS
words is taken over and "..." will be added, unless end of "description"
occurred. Otherwise if number of letters is smaller than D_DESC_MAX_CHARS
, the whole attribute "description"
is taken over. Above in BNF as <description>.(?:^|\W)whole word(?=(?:$|\W))
), then <disambiguation> has syntax without <text in URI between brackets>.