infrahelpers/bom4v
BOM for Verticals (BOM-4-V)
392
Business-relted Object Models (BOM) are software classes (incarned here through Scala case classes) modelling the business of a particular industry (here, an example is given for the telecom industry, but it can easily be generalized to other industries). For instance, there are software classes representing customers, customer accounts, interactions (e.g., calls, messages) that customers exchange among them, markets, and so on.
The fields of those software classes, in addition to their respective associated methods (e.g., classical data processing functions such as aggregation and filtering), make up a high level API (Application Programming Interface). The software developers, data engineers and other data scientists can then reasonate at the business level when implementing more complex data processing techniques. For instance, in order to calculate the distribution of the call duration between any two customers of a given market, a developer has just to invoke the corresponding method on the Scala case class representing a customer.
For each business-related software class, there are associated serializers and de-serializers, allowing:
NRT ASN.1
binary format) and fill in the fields
of those classesThen, the output CSV files can either be (in a non exclusive manner):
Java artefacts (JAR files) are produced (built) for each component, and can then be published in artefact repositories (e.g., Maven, Nexus), for subsequent delivery onto Spark-based production systems, either on-premises or on clouds (eg, DataBricks, GCP, AWS or Azure). The release versions of the BOM4V artefacts are stored on the so-called Maven Central repository, while the snapshot versions are stored on the OSS Sonatype repository.
The metamodels
project itself is an umbrella, allowing to drive all
the other components from a central local directory, namely workspace/src
.
One can then interact with any specific component directly by jumping
(cd
-ing) into the corresponding directory. Software code can be edited
and committed directly from that component sub-directory.
Docker images, hosted on Docker Cloud, are provided for convenience reason, avoiding the need to set up a proper development environment: they provide ready-to-use, ready-to-develop, ready-to-contribute environments on top of a few well known Linux distributions (e.g., CentOS 8, Debian 10 (Buster) and Ubuntu 20.04 (Focal Fossal)). Enjoy!
<linux-distrib>
is centos
, debian
or ubuntu
:$ docker run --rm -it infrahelpers/bom4v:<linux-distrib> bash
[build@c..5 bom4v]$ cd workspace/src/ti-spark-examples
[build@c..5 ti-spark-examples (master)]$ ./fillLocalDataDir.sh
[build@c..5 ti-spark-examples (master)]$ sbt "runMain org.bom4v.ti.Demonstrator"
[info] ...
root
|-- specificationVersionNumber: integer (nullable = true)
...
|-- servingNetwork: string (nullable = true)
+-----+-----+
|churn|count|
+-----+-----+
|False| 2278|
| True| 388|
+-----+-----+
+-----+-----+
|churn|count|
+-----+-----+
|False| 379|
| True| 388|
+-----+-----+
area under the precision-recall curve: 0.9747578698231796
area under the receiver operating characteristic (ROC) curve : 0.8484817813765183
counttotal : 667
correct : 574
wrong: 93
ratio wrong: 0.13943028485757122
ratio correct: 0.8605697151424287
ratio true positive : 0.1184407796101949
ratio false positive : 0.0239880059970015
ratio true negative : 0.7421289355322339
ratio false negative : 0.11544227886056972
[success] Total time: 63 s, completed Dec 19, 2018 4:03:30 PM
[build@c..5 ti-spark-examples (master)]$ exit
The Docker images come with all the dependencies already installed. If there is a need, however, for some more customization (for instance, other software products such as Kafka or ElasticSearch), this section describes how to get the end-to-end Spark-based churn prediction example up and running on a native environment (as opposed to within a Docker container).
An alternative is to develop your own Docker image from the
one provided by that project.
You would typically start the Dockerfile
with
FROM bom4v/sparkml:<linux-distrib>
, where <linux-distrib>
is centos
,
debian
or ubuntu
.
Java and Scala are needed in order to build and run the various components of that project. Moreover, the current execution engines currently relies on Spark.
The maintained Docker images for that project come with all the necessary pieces of software. They can either be used as is, or used as inspiration for ad hoc setup on other configurations.
$ sudo rpm --import http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-Official
$ sudo dnf -y install 'dnf-command(config-manager)'
$ sudo dnf config-manager --set-enabled powertools
$ sudo dnf -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
sudo dnf -y install less htop net-tools which sudo man vim \
git-all wget curl file bash-completion keyutils \
gzip tar unzip maven rake rubygem-rake
$ export SBT_VERSION="1.5.5"
$ sudo dnf -y install https://repo.scala-sbt.org/scalasbt/rpm/sbt-$SBT_VERSION.rpm
$ sudo dnf -y install java-11-openjdk-devel
$ sudo SCALA_VERSION="2.12.14" && \
mkdir -p /opt/scala && \
wget http://www.scala-lang.org/files/archive/scala-$SCALA_VERSION.tgz && \
tar xvf scala-$SCALA_VERSION.tgz && \
mv scala-$SCALA_VERSION /opt/scala && \
rm -f /opt/scala/latest && \
ln -s /opt/scala/scala-$SCALA_VERSION /opt/scala/latest && \
rm -f scala-$SCALA_VERSION.tgz
$ cat >> ~/.bashrc << _EOF
# Scala
export PATH=\${PATH}:/opt/scala/latest/bin
_EOF
$ . ~/.bashrc
$ brew tap adoptopenjdk/openjdk
$ brew cask install adoptopenjdk11
$ brew install maven sbt scala apache-spark
$ scala -version
Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and Lightbend, Inc.
$ sbt about
[info] Loading project definition from ~/project
[info] This is sbt 1.4.9
[info] The current project is built against Scala 2.12.14
The following operation needs to be done only on a native environment (as opposed to within a Docker container). The Docker image indeed comes with that Git repository already cloned and built.
$ mkdir -p ~/dev/bom4v && cd ~/dev/bom4v
$ git clone https://github.com/bom4v/metamodels.git
$ cd metamodels
$ cp docker/centos/resources/metamodels.yaml.sample metamodels.yaml
$ ln -s docker/centos/resources/Rakefile Rakefile
$ rake clone
$ rake checkout
$ rake offline=true info
rake
That operation may be done either from within the Docker container, or in a native environment (on which the dependencies have been installed).
As a reminder, to enter into the container, just type
docker run --rm -it infrahelpers/bom4v:<linux-distrib> bash
(and exit
to leave it).
The following sequence of commands describes how to build, test and deliver the artefacts of all the components, so that Spark can execute the full project:
$ cd ~/dev/bom4v/metamodels
$ rm -rf ~/.m2 ~/.ivy2
$ rake offline=true deliver
$ rake offline=true test
Those operations may be done either from within the Docker container, or in a native environment (on which the dependencies have been installed).
As a reminder, to enter into the container, just type
docker run --rm -it infrahelpers/bom4v:<linux-distrib> bash
,
where <linux-distrib>
is centos
, debian
or ubuntu
(and exit
to leave it).
$ cd ~/dev/bom4v/metamodels
$ cd workspace/src/ti-models-calls
$ vi src/main/scala/org/bom4v/ti/models/calls/CallsModel.scala
$ git add src/main/scala/org/bom4v/ti/models/calls/CallsModel.scala
$ sbt +compile +test
$ sbt 'set isSnapshot := true' +publish-local +publish-m2
$ cd -
$ cd workspace/src/ti-spark-examples
$ vi src/main/scala/org/bom4v/ti/Demonstrator.scala
$ git add src/main/scala/org/bom4v/ti/Demonstrator.scala
$ sbt +compile +test
$ cd -
$ rake offline=true test
$ # If all goes well at the integration level
$ cd workspace/src/ti-models-calls
$ git commit -m "[Dev] Fixed issue #76: wrong field type for the call number"
$ cd -
$ cd workspace/src/ti-spark-examples
$ git commit -m "[Dev] Adapted to the new ti-models-calls structure"
$ cd -
If the Docker images need to be re-built, the following commands explain
how to do it (<linux-distrb>
is one among corettp
, centos
):
$ mkdir -p ~/dev/bom4v && cd ~/dev/bom4v
$ git clone https://github.com/bom4v/metamodels.git
$ cd metamodels
$ docker build -t infrahelpers/bom4v:<linux-distrib> docker/<linux-distrib>/
$ docker push infrahelpers/bom4v:coretto
$ docker images | grep "^bom4v"
REPOSITORY TAG IMAGE ID CREATED SIZE
infrahelpers/bom4v coretto 9a33eee22a3d About an hour ago 2.16GB
So far, we have seen how to launch the application on the Spark engine embedded by the JVM spawned by SBT. That embedded Spark engine has some limitations, and a vanilla version of Spark installation may be preferred for more demanding use cases.
On recent Spark installations, there is no need to prefix
file-paths by hdfs://
or to specify absolute file-paths:
/user/$USER
) on HDFSIn the following sections, details are given on how to interact with HDFS for instance, to transfer back and forth betwwen the local filesystem and HDFS), but most of those operations are now optional on a local Spark installation.
$ export HDFS_URL="hdfs://127.0.0.1:9000"
$ alias hdfsfs='hdfs dfs -Dfs.defaultFS=$HDFS_URL'
$ export HDFS_USR_DIR="/user/<user>"
$ hdfsfs -mkdir -p $HDFS_USR_DIR/data/cdr
$ hdfsfs -put data/cdr/CDR-sample.csv $HDFS_USR_DIR/data/cdr
$ hdfsfs -cat $HDFS_USR_DIR/data/cdr/CDR-sample.csv|head -3
$ export MVN_CHD_REPO="$HOME/.m2/repository"
$ $SPARK_HOME/bin/spark-submit \
--class org.bom4v.ti.Demonstrator \
--master local --deploy-mode client \
--jars \
file:$MVN_CHD_REPO/org/bom4v/ti/ti-models-calls_2.12/0.0.1/ti-models-calls_2.12-0.0.1.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-serializers-calls_2.12/0.0.1-spark2.3/ti-serializers-calls_2.12-0.0.1-spark2.3.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-serializers-customers_2.12/0.0.1-spark2.3/ti-serializers-customers_2.12-0.0.1-spark2.3.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-models-customers_2.12/0.0.1/ti-models-customers_2.12-0.0.1.jar \
target/scala-2.12/ti-spark-examples_2.12-0.0.1-spark2.3.jar
$ $SPARK_HOME/bin/spark-submit \
--class org.bom4v.ti.Demonstrator \
--master yarn --deploy-mode client \
--jars \
file:$MVN_CHD_REPO/org/bom4v/ti/ti-models-calls_2.12/0.0.1/ti-models-calls_2.12-0.0.1.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-serializers-calls_2.12/0.0.1-spark2.3/ti-serializers-calls_2.12-0.0.1-spark2.3.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-serializers-customers_2.12/0.0.1-spark2.3/ti-serializers-customers_2.12-0.0.1-spark2.3.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-models-customers_2.12/0.0.1/ti-models-customers_2.12-0.0.1.jar \
target/scala-2.12/ti-spark-examples_2.12-0.0.1-spark2.3.jar
If the jobs are to be launched from a remote machine, you may want to map the local HDFS port to the HDFS port of the remote machine. For instance, from an independent terminal window on the local machine:
$ The -N option allows to not launch any command (eg, bash)
$ ssh <user>@<remote-machine> -N -L 9000:127.0.0.1:9000
Then, the following commands will work:
$ export HDFS_URL="hdfs://127.0.0.1:9000"
$ alias hdfsfs='hdfs dfs -Dfs.defaultFS=${HDFS_URL}'
$ export ATF_USR_DIR="/user/<user>/artefacts"
$ export ATF_USR_URL="${HDFS_URL}${ATF_USR_DIR}"
$ hdfsfs -mkdir -p $ATF_USR_DIR
$ hdfsfs -put -f target/scala-2.12/ti-spark-examples_2.12-0.0.1-spark2.3.jar $ATF_USR_DIR
$ $SPARK_HOME/bin/spark-submit \
--class org.bom4v.ti.Demonstrator \
--master yarn --deploy-mode cluster \
--jars \
file:$MVN_CHD_REPO/org/bom4v/ti/ti-models-calls_2.12/0.0.1/ti-models-calls_2.12-0.0.1.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-serializers-calls_2.12/0.0.1-spark2.3/ti-serializers-calls_2.12-0.0.1-spark2.3.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-serializers-customers_2.12/0.0.1-spark2.3/ti-serializers-customers_2.12-0.0.1-spark2.3.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-models-customers_2.12/0.0.1/ti-models-customers_2.12-0.0.1.jar \
target/scala-2.12/ti-spark-examples_2.12-0.0.1-spark2.3.jar
docker pull infrahelpers/bom4v