This program extracts certain data from the BabelNet lexical ontology. There are three actions implemented for synset, sense, and neighbourhood extraction, correspondingly. The data processing routines are implemented using multithreading, so they should scale well as long as the underlying storage allows it.
For running this program Java 8 and Maven 3 are required among the working BabelNet Java API setup. The BabelNet API configuration files should be located in the working directory from which the program is run.
Given the set of word sense clusters, this action writes two files:
words.txt with the list of synsets per clusters, and
synsets.txt with the list of the synsets containing the input words. The paths of both output files can be specified using the
-synsets options, correspondingly.
java -jar target/babelnet-extract.jar -action clusters -clusters "clusters.txt" -words "words.txt" -synsets "synsets.txt"
clusters.txt input file should be formatted according to the Chinese Whispers program tab separated output format
cluster<TAB>size<TAB>senses as follows. Note that the sense labels like
#3 are ignored by the parser.
0 2 word#1, word#2 1 1 word#3
Given the set of synsets, extract the corresponding sense lemmas and their frequencies, and write the file
senses.txt, the path of which can be specified using the
java -jar target/babelnet-extract.jar -action senses -synsets "synsets.txt" -senses "senses.txt"
synsets.txt input file should be produced by the synset extraction action containing a list of BabelNet synset identifiers.
Given the set of synsets, extract the n-level ego network for each of them and write the tab separated file
neighbours.txt, the path of which can be specified using the
-neighbours option. Each neighbour has a distance provided with the plus sign if the neighbour is reachable through the hypernym, otherwise, the minus sign is written.
java -jar target/babelnet-extract.jar -action neighbours -synsets "synsets.txt" -depth 2 -neighbours "neighbours.txt"
The format of the
synsets.txt input file is the same as in the sense extraction action.
This action writes the file
synsets.txt representing the BabelNet synsets for the given language specified using the
java -jar target/babelnet-extract.jar -action synsets -synsets "synsets.txt" -language ru
The format of the
synsets.txt output file is the same as the format of the
clusters.txt file in the cluster extraction action.
A couple of preliminary steps needs to be done before building this application with Maven. Firstly, it is necessary to download and unpack the BabelNet-API-3.7.zip archive. Secondly, two dependencies,
babelnet-api, need to be installed to the local Maven repository as follows.
mvn install:install-file -Dfile=lib/jltutils-2.2.jar -DgroupId=it.uniroma1.lcl.jlt -DartifactId=jltutils -Dversion=2.2 -Dpackaging=jar unzip -p babelnet-api-3.7.jar META-INF/maven/it.uniroma1.lcl.babelnet/babelnet-api/pom.xml | grep -vP '<(scope|systemPath)>' >babelnet-api-3.7.pom mvn install:install-file -Dfile=babelnet-api-3.7.jar -DpomFile=babelnet-api-3.7.pom
Note that the commands should be run inside the BabelNet API directory. The instructions are also available in Russian on NLPub: https://nlpub.ru/BabelNet. Having the preliminary setup completed, it is necessary to change the directory to
babelnet-extract and then use Maven to compile and package the application.
Other versions than BabelNet API 3.7 might also work, it is sufficient just to change the version value of the necessary BabelNet version in
This image has been designed with an assumption that the BabelNet offline index should be mounted as volume to the
In case SELinux is enabled, please update the BabelNet index security context:
chcon -Rt svirt_sandbox_file_t BABELNET_PATH.
docker run --rm -it -v './BabelNet-3.7:/babelnet/index' -v './output:/babelnet/output' nlpub/babelnet babelnet-extract