Semantic Enrichment Component

Semantic Enrichment Component (SEC) is part of the system 4A, connecting several of its parts. Provides semantic text enrichment services, for example text annotation annotate, and listing of all types and attributes in KB get_entity_types_and_attributes.

Service is publicly available at http://sec.fit.vutbr.cz/ on port 8082 (Protocol Documentation).

1 Prerequisities.
2 Description of parts
3 Multiple NERs support
- 3.1 Output specification from NERs
4 Calling SEC as a library
5 Launching on grid
- 5.1 Usage in the manner of NER
6 Launching on Salomon
- 6.1 Variants
7 Provided services
8 Some of the generated attributes for types
9 References

1 Prerequisities

The current version is available on git in branch D114-SEC_API:

 git clone http://sec.fit.vutbr.cz/sec/secapi.git secapi && cd secapi
 git checkout -b D114-SEC_API origin/D114-SEC_API

This will create secapi directory, move to this directory. Download KB in the ./NER directory and then move to ./SEC_API, where the necessary programs will be compiled using the make command.

 cd ./NER && ./deleteKB.sh && ./downloadKB.sh)
 cd ./SEC_API && make)

It is necessary to be aware of the fact that when using script downloadKB.sh, KB and machines (*.fsa) cannot be located in the directories secapi/NER and secapi/NER/figa. It is advised to delete them using the deleteKB.sh script. Beware, scripts downloadKB.sh and deleteKB.sh have to be launched only from the directory in which they are located (thus ./NER)!

2 Description of parts

SEC can be found in the directory ./SEC_API (it is the working path — relative paths will be derived from there). Its scripts are written so that they can be called from any directory. At the moment SEC is divided into scripts sec_daemon.py, sec.py and sec_api.py.

2.1 Script `sec_daemon.py`

Script sec_daemon.py is the core of SEC. It has been made to reduce memory demands while running sec.py in parallel. By launching this script Unix domain socket (UDS) is created and is waiting for connection of several instances of scripts sec.py or sec_api.py. instances of these two scripts communicate with sec_daemon.py using the internal communication protocol described below.

2.1.1 Usage

 ./sec_daemon.py [-h] [-p PATH] [--own_kb_daemon]

 Optional arguments:
   -h, --help            shows help and then terminates.
   -p PATH, --uds_path PATH
                         Sets path to Unix domain socket, where daemon is 
                         waiting for clients. Default value is ./daemon_uds 
                         comparatively to script's directory.
   --own_kb_daemon       Launches its own KB daemon even if any other is already running.

2.2 Script `sec.py`

Script sec.py is client of deamon sec_daemon.py. It is used to present the SEC services to the user. On standard input requirement in JSON is expected. Answer is passed to standard output. Description of services and requirements with examples can be seen in ./doc/sec_api.pdf after compilation by command make.

2.2.1 Usage

 ./sec.py [-h] [-t [DIRECTORY]] [-p PATH] [-c CONFIG.json] [--plaintext]
          [-f FILENAME]

 Optional arguments:
   -h, --help            shows help and then terminates.
   -t [DIRECTORY], --testing_mode [DIRECTORY]
                         Switches to test mode, which will allow to check work 
                         with structured annotations that NER is not familiar with. 
                         Meaning - service "annotate" is looking in URI query
                         "DOCUMENT_URI" for value of key "tid" and according to this 
                         looks in directory DIRECTORY for file with answer to
                         "annotation_format". If such file is found, instead of 
                         results from NER, its content will be returned. URI
                         query "DOCUMENT_URI" can contain key "aid". Unlike key
                         "tid", content of file found accordingly to this value,
                         the result of NER will be only enriched (connected to it).
                         Default value is ./testing_mode comparatively to script's 
                         directory.
   -p PATH, --uds_path PATH
                         Sets path to Unix domain socket, where daemon is 
                         waiting for clients. Default value is ./daemon_uds 
                         comparatively to script's directory.
   -c CONFIG.json, --config_file CONFIG.json
                         Sets the service and its parameters from JSON file, 
                         instead of standard input. In this case just a text to 
                         be processed or nothing is expected on standard input.
   --plaintext           Output of services "annotate", "annotate_vertical"
                         and "get_raw_annotations" is a plain text. If an exception 
                         occurs, it remains in JSON.
   -f FILENAME, --filename FILENAME
                         Sets filename for service "annotate_vertical".

2.3 Script `sec_api.py`

Script sec_api.py is very similar to script sec.py and that is why it uses its part. Unlike it, more requests can be entered on standard input per one instance. Each request will print the answer out to standard output. Server using HTTP protocol will be created as well during launching, waiting on port 8082. Any HTTP client can send a SEC request through this script for a specific service via HTTP request POST and get a response.

2.3.1 Usage

 ./sec_api.py [-h] [-t [DIRECTORY]] [-p PATH] [-n PORT]

 Optional arguments:
   -h, --help            Shows help and then terminates.
   -t [DIRECTORY], --testing_mode [DIRECTORY]
                         Switches to test mode, which will allow to check work 
                         with structured annotations that NER is not familiar with. 
                         Meaning - service "annotate" is looking in URI query
                         "DOCUMENT_URI" for value of key "tid" and according to this 
                         looks in directory DIRECTORY for file with answer to
                         "annotation_format". If such file is found, instead of 
                         results from NER, its content will be returned. URI
                         query "DOCUMENT_URI" can contain key "aid". Unlike key
                         "tid", content of file found accordingly to this value,
                         the result of NER will be only enriched (connected to it).
                         Default value is ./testing_mode comparatively to script's 
                         directory.
   -p PATH, --uds_path PATH
                         Sets path to Unix domain socket, where daemon is 
                         waiting for clients. Default value is ./daemon_uds 
                         comparatively to script's directory.
   -n PORT, --net_port PORT
                         Sets port, where SEC is waiting for clients. Default value
                         is 8082.

2.4 Internal communication protocol

Internal communication protocol is based on model client-server using Unix domain socket (UDS) in stream mode.

Key points include:

simplicity
ability to transfer file descriptor(both ways)
ability to transfer large amounts of data

2.4.1 Procedure

Server is waiting for clients.
Client connects.
Client sends the settings to server (directory to the test mode and JSON with setting of the required service).
Server receives the settings and gives client confirmation.
Client sends data to be processed to server (if required service does not require any data, the data may be equal to zero).
Server can send client a request (within the test mode) to open several files and send its file descriptor (this also demonstrates the client's permission to open the file).
Server sends processed data to client.
Client closes connection or continues to point no. 5, respectively point no. 3.

If the server detects incorrect settings or an error occurs during processing, the client is sent the error information and the connection is terminated.

2.4.2 Commands and packet structure

You can check commands in file daemon_lib.py. They have dynamically generated two-digit number Opcode.

For commands two packet structures are being used. For errors it is:

  2 bytes        String      2 bytes
  -----------------------------------
 | Opcode |  Error message  |  CRLF  |
  -----------------------------------

For the rest (except for file descriptor) it is a structure that is being repeated until the number of bytes is equal to zero:

  2 bytes        Number (decimal)        2 bytes    N bytes    2 bytes
  ---------------------------------------------------------------------
 | Opcode |  Number of bytes of data N  |  CRLF  |  Raw data  |  CRLF  |
  ---------------------------------------------------------------------

Library python-fdsend is being used to send file descriptors.

3 Multiple NERs support

In development - documentation will be completed later (contains only essential facts at the moment):

Those who will wrap another NERs will have to create a class that inherits class NERTemplate and re-define method _process(), eventually methods _start() and _end(), if these methods would have allocated/deallocated something. You can check comment of class NERTemplate for more details.
If wrap of NER will be added, it must be added to file sec_daemon.py under:

       ner_manager.appendNER("default", module_annotate.NER())

Similar line with another name of NER and instance of another wrap

A NER dictionary has been created and thanks to the name of NER - the requested one will be chosen.
A selection when calling service annotate and a service to get a list of available NERs will be added later.

3.1 Output specification from NERs

Specification is created according to our NER and other requests. At the output from NERs is expected this syntax (BNF):

 <output from NERs> ::= <origin_base>
     | <origin_base> "\t" <id>
     | <origin_base> "\t" <id> "\t" <direct_attributes>
 <origin_base> ::= <start_offset> "\t" <end_offset> "\t" <data_type> "\t" <string_between_offsets> "\t" <data>
 <data_type> ::= "kb"
     | "activity"
     | "date"
     | "interval"
     | "coref"
     | "uri"
 <data> ::= <data-kb>
     | <data-activity>
     | <data-date>
     | <data-interval>
     | <data-coref>
     | <data-uri>
 <data-kb> ::= <KB_row> | <KB_row> ";" <data-kb>
 <data-date> ::= <year> "-" <month> "-" <day>
 <data-interval> ::= <data-date> " -- " <data-date>
 <data-coref> ::= <data-kb>
 
 <direct_attributes> ::= <attribute> | <attribute> "|" <direct_attributes>
 <attribute> ::= <attribute_name> "[" <attribute_type> "]=" <attribute_value>
 <attribute_type> ::= "string" | "decimal" | "date" | "image" | "integer" | "uri" | <other_attribute_type> 
 
 <year> ::= <digit> <digit> <digit> <digit>
 <month> ::= <digit> <digit>
 <day> ::= <digit> <digit>
 <digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

where:

<start_offset> and <end_offset> represents start and end position of the entity in input text
<data_type> indicates the type of entity
- "kb": then <data> are numbers of lines from KB separated by ";"
- "activity" - suggests that <string_between_offsets> is a verb: <data> is a infinitive verb <string_between_offsets>
- "date": then <data> is a date in ISO 8601 format YYYY-MM-DD
- "interval": then <data> are two dates in ISO 8601 format YYYY-MM-DD separated by " -- "
- "coref" — suggests that <string_between_offsets> is a pronoun: As with "kb"
- "uri": then <data> is URI on wikipedia, freebase, dbpedia or similar
<string_between_offsets> is a text form of an entity from the input text
<data> are data according to <data_type>, thus <data-<data_type>> (see below)
<id> is the identifier of annotation
- unique within the output from NER
- can be anything (e.g. URI, number, empty string)
- has effect only in output to SXML
<direct_attributes> are attribures extra added by NER separated by "|"
- in most cases URI, number or string without special characters
- cannot contain the following characters '|', '\t' and '\n'
- has effect only in output to SXML
<attribute_type>based on XSD, however can also contain other types of which I am not aware where they are described (e.g. type "AnnotationLink")
- "string" - string
- "decimal" - numerical value
- "date" - date in format YYYY-MM-DD
- "image" - URI to image
- "integer" - integer
- "uri" - URI in general

4 Calling SEC as a library

In order to not need to create subprocess with sec.py while using in SEC by another program - class Sec in sec.py has been created. Its methods are described in the source code. Like with sec.py it is necessary to have script sec_daemon.py launched and to initialize this class with path to Unix domain socket, where daemon is waiting. Configurations are defined similarly to sec.py, the only difference is that instead of JSON an alternative of Python is being used (see table).

5 Launching on grid

For launching of SEC on grid (SGE) script ./sge/sec.sh has been created.

Several requirements were placed:

Must ensure several seperately running instances SEC on one machine (only for one sec.py, there will be sec_daemon.py and shared KB).
Each machine in grid is connected to the same disk via NFS. The machines have the same processor architecture. We would like to use the binary form of KB. We must therefore ensure that the binary representation KB will not be generated by multiple processes at once.

Final ./sge/sec.sh accepts the same arguments as sec.py. Even though it was designed for launching on grid, it is possible to use it on ordinary machines (to be sure).

Within this aim switch --own_kb_daemon has been created at sec_daemon.py and --plaintext at sec.py. For this purpose ability to change name of shared memory by an argument of program in KB of deamon has been created.

5.1 Using in the manner of NER

To use SEC with stdin/stdout of NER you can use service get_raw_annotations. It is necessary to create a configuration file (for example "get_raw_annotations.cfg"), e.g. with:

 {
     "get_raw_annotations": {}
 }

Then NER can be called via SEC using this command:

 ./sge/sec.sh -c get_raw_annotations.cfg --plaintext

6 Launching on Salomon

To launch on supercomputer Salomon (IT4I) - scripts in directory ./salomon have been created.

SEC is dependent on several libraries that are not installed on Salomon. It is necessary to copy them from knot09:/mnt/minerva1/nlp/projects/corpproc/dependencies_for_salomon/opt. It can be done e.g. this way:

 $ mkdir -p ~/mnt/ssh-knot-knot09
 $ sshfs xlogin01@knot09.fit.vutbr.cz:/ ~/mnt/ssh-knot-knot09/
 $ cp -r ~/mnt/ssh-knot-knot09/mnt/minerva1/nlp/projects/corpproc/dependencies_for_salomon/opt ~/
 $ fusermount -u ~/mnt/ssh-knot-knot09

Dependencies are already assembled. If a new complilation would be necessary, launch ./salomon/prepare.sh.

To launch use one of the several variations of script ./salomon/start.sh. Each variant expects:

directory ~/parsed in which verticals to be processed are located
file ~/namelist in which names of verticals from ~/parsed without extension \.vert.* are located. Can be obtained e.g. using this command:

       $ ls ~/parsed | sed 's/\.vert.*//g' > ~/namelist

directory ~/configs with configuration file mg4j.cfg

created upon launch:

directory ~/logs with errors
directory ~/secsgeresult with results

6.1 Variants

Variant ./salomon/start.sh will launch instance ./sge/sec.sh separately on one node for each file from ~/namelist.
Variant ./salomon/v2/start.sh requires argument defining number of jobs per node. According to the number of jobs and the number of files in ~/namelist, necessary number of jobs will be created, these jobs will occupy all nodes available by user per one job according to limits.

7 Provided services

7.1 Service `annotate`

This service returns annotations for the specified document. It uses enrichment engine chosen by user. Annotations include information about their location in a document (start and end offset), lenght and annotated text itself. It also contains information obtained from KB, including e.g. type, name and URL on wikipedia.

Enrichment engine is chosen by user in parameter "enrichment_engine". User can also assign its maximal processing time using the "enrichment_engine_timeout" parameter. To print all enrichment engines you can use the "get_enrichment_engines" service.

Text in the document is usually ambiguous and that is why enrichment engine might find multiple entities to the particular text. If parameter "disambiguate" is set, then enrichment engine will select the most probable meaning of annotated text.

Output format can be chosen using the "annotation_format" parameter. It is possible to choose multiple output formats for one input. ^{Note: This might be edited later on in order to have always correct JSON as output when "plaintext": false.} Parameter "annotation_format" can have these values:

html - Output is a HTML document containing original text enriched with the annotations found. In a browser annotated parts are underlined and after hovering over them will display a block of information from KB. For this output format "disambiguate": 0 will cause an error.
index - Output is a plain text document intended for full-text indexing. It contains original text enriched with the annotations found.
nif - Output contains only annotated text in format NIF.
rdf - Output contains only annotated text in format RDF.
sxml - Output is a XML document containing annotated text only. It is designed mainly for further processing.
text - Output is a plain text document readable by a human. It contains annotated text only.
xml - Output is a XML document containing original text enriched with the annotations found.

Using the "types_and_attributes" parameter you can specify what information from KB will be included to the output. It is possible to allow specific types and all of their attributes (syntax { str(type): "all" }) or some of them (syntax { str(type): [ str(attribute), ... ] }). Its default value is "all", which means that statement with all types of annotations and its attributes is allowed. All available types and their attributes can be printed using the "get_entity_types_and_attributes" service.

Parameter "document_uri" is used to enter the URL from which the document was taken over. If output format NIF is set, this parameter is required.

If parameter "plaintext" is set to true, encapsulation of output to JSON is canceled. In this case various output formats are separated by character '\0'.

Samples and more information can be found here.

7.1.1 Request format

 {
     "annotate": {
         "input_text": str,
         "annotation_format": [ str, ... ],
         "disambiguate": int,
         "document_uri": str,
         "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] },
         "enrichment_engine": str,
         "enrichment_engine_timeout": int,
         "plaintext": bool
     }
 }

Optional attributes are: ["disambiguate", "types_and_attributes", "document_uri", "enrichment_engine", "enrichment_engine_timeout", "plaintext"].
Attribute "document_uri" is required if values of attribute "annotation_format" include "NIF".

7.1.2 Answer format

 {
     "annotation": str
 }

7.1.3 Output format

Output format of this service can be chosen by parameter "annotation_format". These formats are described below.

7.1.3.1 SXML

XML document including annotated text only. It is designed mainly for further processing.

 <?xml version="1.0" encoding="UTF-8"?>
 <!-- Generated using: trang -I xml -O rng *.sxml SXML.rng -->
 <grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <element name="suggestion">
      <zeroOrMore>
        <element name="text">
          <attribute name="e_offset">
            <data type="integer"/>
          </attribute>
          <attribute name="s_offset">
            <data type="integer"/>
          </attribute>
          <attribute name="string"/>
          <zeroOrMore>
            <element name="annotation">
              <optional>
                <attribute name="id">
                  <data type="anyURI"/>
                </attribute>
              </optional>
              <attribute name="type">
                <data type="NCName"/>
              </attribute>
              <zeroOrMore>
                <element name="attribute">
                  <optional>
                    <attribute name="annotType">
                      <data type="NCName"/>
                    </attribute>
                  </optional>
                  <attribute name="name">
                    <data type="NCName"/>
                  </attribute>
                  <attribute name="type">
                    <data type="NCName"/>
                  </attribute>
                  <text/>
                </element>
              </zeroOrMore>
            </element>
          </zeroOrMore>
        </element>
      </zeroOrMore>
    </element>
  </start>
 </grammar>

7.1.3.2 XML

7.1.3.3 HTML

7.1.3.4 Text

7.1.3.5 Index

7.1.3.6 Index2

7.1.3.7 RDF

7.1.3.8 NIF

7.2 Service `annotate_vertical`

Special clone of service "annotate"for annotation of vertical. (For this reason, I will describe only the difference.) One of it's parts - the "deverticalize" service takes care of gradual obtaining of individual documents from the input in vertical format. Output format must be specified in the request.

7.2.1 Request format

 {
    "annotate_vertical": {
        "input_text": str,
        "annotation_format": str,
        "vert_in_cols": [ str, ... ],
        "vert_out_cols": [ str, ... ],
        "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] },
        "enrichment_engine": str,
        "enrichment_engine_timeout": int,
        "filename": str,
        "num_workers": int,
        "plaintext": bool,
        "max_values_per_col": int | null,
        "wiki_mode": bool,
        "enable_figa": bool
    }
 }

Optional attributes are: ["vert_in_cols", "vert_out_cols", "types_and_attributes", "enrichment_engine", "enrichment_engine_timeout", "filename", "num_workers", "plaintext"].
Attribute "annotation_format" can have values: ["mg4j", "manatee", "manatee2", "elasticsearch"].
- mg4j - Output is in the vertical format MG4J.
- manatee - Output is in the vertical format Manatee (annotations are enriched with XML tags).
- manatee2 - Output is in the vertical format Manatee2 (annotations are added to additional columns similarly to MG4J).
- elasticsearch - Output is in the format ElasticSearch (one document of vertical = one structure on output).
Attribute "vert_in_cols", respectively "vert_out_cols" determines the meaning of each invidivual column in vertical on input. If it is not set, meaning is standard ⇒ first 13 MG4J columns. The previously used attribute "vert_cols" is still functional, but it is not used anymore. The names of columns are listed in the header of MG4J format.
Attribute "filename" sets name of source file. It is used in the output format MG4J.
Attribute "num_workers" sets number of processed documents in parallel from vertical. Default value is number of processes.
Attribute "max_values_per_col" sets maximum number of values per column. It is used in the output format MG4J. Default value is 4.
Attribute "wiki_mode" allows use of column "docuri" to process verticals from wikipedia. It works like this:
- If the annotation is a co-reference, then:
  - it sets column "docuri" to "1" if it is pointing at the exact entity with the exact URL (column "url") as the URL of the website.
  - otherwise it sets column "docuri" to "2".
- Otherwise:
  - sets column "docuri" to "4" if the hypertext link (column "link") is empty or exactly same the as URL of the entity (column "url").
  - otherwise:
    - if attribute "enable_figa" == true then it is looking for a entity in KB according to URL in hypertext link:
      - if such entity is found, the "docuri" column is set to its type.
      - otherwise the "docuri" column is set to "4".
    - otherwise the "docuri" column is set to "3".
- Default value are "wiki_mode": false and "enable_figa": true.

7.2.2 Answer format

 {
     "annotation": str | [
         {
             "title": str,
             "uri": str,
             "article": str
         },
         ...
     ]
 }

In case that "annotation_format" will be "elasticsearch", value of key "annotation" will be a list of objects, otherwise it will be a string.

7.3 Service `deverticalize`

Deverticalizes text in vertical format (see https://www.sketchengine.co.uk/documentation/preparing-corpus-text/ or http://nlp.fi.muni.cz/cs/PopisVertikalu).

7.3.1 Request format

 {
     "deverticalize": {
         "input_text": str,
         "vert_in_cols": [ str, ... ]
     }
 }

Optional attributes are: ["vert__in_cols"].
Attribute "vert_in_cols" determines meaning of each individual column in input vertical. If it is not set, meaning is standard ⇒ first 13 MG4J columns. The previously used attribute "vert_cols" is still functional, but it is not used anymore.

7.3.2 Answer format

 {
     "deverticalized": [
         {
             "id": str,
             "document": str
         },
         ...
     ]
 }

7.3.3 Errors

If the input text contains 3 exclamation marks consecutively in tags <s></s> and with tags <g/> then the tag <g/> will not stick together (e.g.: "<s>\nHello\n</g>\n!</s><s></g>\n!</s><s></g>\n!</s>" ⇒ "Hello!\n!\n!\n") because of <s>.

7.4 Service `get_enrichment_engines`

Lists all available enrichment engines, these are the values that attribute "enrichment_engine" (used at some services) can have.

7.4.1 Request format

 {
     "get_enrichment_engines": {}
 }

7.4.2 Answer format

 {
     "enrichment_engines": [ str, ... ]
 }

7.5 Service `get_entities`

Each entity from KB with the same or similar name as the name specified in attribute "input_string" will be printed. Output is ordered according to the value of attribute of entity "confidence". You can filter as well as in service "annotate" using the "types_and_attributes" attribute.

Samples and more information can be found here.

7.5.1 Request format

 {
     "get_entities": {
         "input_string": str,
         "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] },
         "max_results": int
     }
 }

Optional attributes are: ["max_results"].
Attribute "input_string" sets name of the wanted entity or the initial part of the name ending with an asterisk '*'.
Attribute "types_and_attributes" filters in the exactly same way as service "annotate".
Attribute "max_results" sets the maximum number of entities in output. Default value is 10.

7.5.2 Answer format

 {
     "data": [
         {
             str(type): {
                 str(attribute): str,
                 ...
             }
         },
         ...
     ]
 }

7.6 Service `get_entity_by_uri`

Search for entities by URI.

7.6.1 Request format

 {
     "get_entity_by_uri": {
         "input_string": str,
         "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] }
     }
 }

7.6.2 Answer format

 {
     "data": [
         {
             str(type): {
                 str(attribute): str,
                 ...
             }
         },
         ...
     ]
 }

7.7 Service `get_entity_types_and_attributes`

Lists all the available types and their attributes. This information can be used at attribute "types_and_attributes" that is used as a filter for certain services.

7.7.1 Request format

 {
     "get_entity_types_and_attributes": {}
 }

7.7.2 Answer format

 {
     "data": [
         {
             "type": str(type),
             "attributes": [
                 str(attribute),
                 ...
             ]
         },
         ...
     ]
 }

7.8 Service `get_kb_version`

Returns the version number of loaded KB.

7.8.1 Request format

 {
     "get_kb_version": {}
 }

7.8.2 Answer format

 {
     "version": int
 }

7.9 Service `get_raw_annotations`

Returns string obtained by NER.

7.9.1 Request format

 {
     "get_raw_annotations": {
         "input_text": str,
         "disambiguate": int,
         "enrichment_engine": str,
         "enrichment_engine_timeout": int,
         "plaintext": bool
     }
 }

Optional attributes are: ["disambiguate", "enrichment_engine", "enrichment_engine_timeout", "plaintext"].

7.9.2 Answer format

 {
     "annotation": str
 }

8 Some of the generated attributes for types

"confidence"

"identifier"

Not in KB.
Contains URI obtained from attributes with symptom 'i'.

"disambiguation"

 <disambiguation> ::= <text in URI between brackets> "," <description> "(" <interval of living> ")"
                    | <text in URI between brackets> "(" <interval of living> ")"
                    | <text in URI between brackets>
                    | <description> "(" <interval of living> ")"
                    | <description>
                    | <interval of living>
                    | ""

 <interval of living> ::= <YYYY-MM-DD date of birth> -- <YYYY-MM-DD date of death>
                   | "born " <YYYY-MM-DD date of birth>

Not in KB.
It is generated from attributes ["wikipedia_url", "date_of_birth", "date_of_death", "description"].
1. If attribute "wikipedia_url" contains parentheses (thus ["(", ")", "%28", "%29"]), then it reproduces the text between them - underscores are replaced with spaces. Above in BNF as <text in URI between brackets>.
2. If attribute "description" contains a substring starting with "is|was a|an|the", then its D_DESC_MAX_WORDS words is taken over and "..." will be added, unless end of "description" occurred. Otherwise if number of letters is smaller than D_DESC_MAX_CHARS, the whole attribute "description" is taken over. Above in BNF as <description>.
3. If a date of birth or death is available, it is added to the previous one. Above in BNF as <interval of living>.
4. If <description> contains a substring <text in URI between brackets> as a whole word (thus (?:^|\W)whole word(?=(?:$|\W))), then <disambiguation> has syntax without <text in URI between brackets>.