Azure

Integration testing Service Fabric & Traefik with Docker

Here is the plan:

  1. Use docker to run a Service Fabric Linux cluster in a container
  2. Deploy a test app to the cluster and create 25 instances of it

Aim: While developing the Traefik SF integration it will provide a simple cluster to use, debug and perform integration testing.

*TLDR: Have a look the full code in this PR

It was a fun journey but I got it working…

First hurdle: Get a SF Cluster image which starts quickly.

Our starting point is this guide here. In it you pull an the ServiceFabric docker image and then execute ./setup.sh and ./run.sh, which are scripts contained in the image.

Note: You need to add the additional docker daemon config to run SF in a container. See details here.

Once done you have a cluster running in the container. Unfortunately ./setup.sh takes a while to run, installing packages from apt, such as the JVM, and then setting up some environment configuration.

Fortunately, we can build our own image on top of this to makes things quicker. Essentially, we’ll run the ./setup.sh as part in our docker build so it doesn’t have to run when we start the container.

My first attempt at this had a dockerfile like so:

FROM servicefabricoss/service-fabric-onebox
WORKDIR /home/ClusterDeployer
RUN ./setup.sh
EXPOSE 19080 19000 80 443
CMD ./run.sh

Now if you build an image from this it will start and you can connect to the explorer. However, when you try and upload to the ImageStore you’ll get a 500 Code response.

So after some messing around I took at look at the setup.sh script and found this at the end:

/etc/init.d/ssh start
locale-gen en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US:en
export LC_ALL=en_US.UTF-8

These actions wouldn’t be capture in the docker image we built as they’re not persisted. So we can update our Dockerfile to capture these like so:

FROM servicefabricoss/service-fabric-onebox
WORKDIR /home/ClusterDeployer
RUN ./setup.sh
#Generate the local
RUN locale-gen en_US.UTF-8
#Set environment variables
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US:en
ENV LC_ALL=en_US.UTF-8
EXPOSE 19080 19000 80 443
#Start SSH before running the cluster
CMD /etc/init.d/ssh start && ./run.sh

Now this will build and run as you would expect! Yippi! You can try this now with this command:

docker run --name sftestcluster -d --rm -p 19080:19080 -p 19000:19000 -p 25100-25200:25100-25200 lawrencegripper/sfonebox

Deploying into the container

So how about deploying into onto this cluster we have running? Well we could install and run the sfctl tool on our machine but keeping tools up to date and making sure they don’t clash with others is a pain so lets use docker here again.

We can create a sfctl docker image with a Dockerfile like so:

FROM python:3
RUN pip3 install sfctl
RUN sfctl cluster select --endpoint http://localhost:19080
WORKDIR /src
ENTRYPOINT [ "bash" ]

Now for some orchestration, we need to run up the container for the cluster, wait for it to become healthy, then deploy our app into is using the sfctl container.

We can write a nice BASH function to poll the clusters health endpoints, we use JQ to parse the output and pick out the Health and NodeCount. The function looks like this:

function isClusterHealthy () {
    echo "Checking cluster status..."
    HEALTHURL="http://localhost:19080/$/GetClusterHealth?NodesHealthStateFilter=1&ApplicationsHealthStateFilter=1&EventsHealthStateFilter=1&api-version=3.0"
    HEALTH_RESULT="$(wget --timeout=1 -qO - "$HEALTHURL" | jq -r .AggregatedHealthState)"
    NODE_COUNT="$(wget --timeout=1 -qO - "$HEALTHURL" | jq -r .HealthStatistics.HealthStateCountList[0].HealthStateCount.OkCount)"
    echo "Current Status $HEALTH_RESULT Nodes: $NODE_COUNT"
    if [ "$HEALTH_RESULT" = "Ok" ] && [ "$NODE_COUNT" = "3" ]; then
        return 1
    else
        echo "Waiting for health with 3 Nodes..." 
        return 0
    fi  
}; 

Now we can deploy into our cluster with a script like this:

#!/bin/bash
echo "######## Upload app ###########"
sfctl application upload --path ./testapp
echo "######## Provision type ###########"
sfctl application provision --application-type-build-path testapp
echo "######## Create 200 instances ###########"
for i in {100..150}
do
   ( echo "Deploying instance $i"
   sfctl application create --app-type NodeAppType --app-version 1.0.0 --parameters "{\"PORT\":\"25$i\"}" --app-name fabric:/node25$i ) &
done
echo "Waiting for deployment to complete..."
wait

We mount this into our ‘sfctl’ container and execute it like so:

docker run --name appinstaller -it --rm --network=host -v ${PWD}:/src lawrencegripper/sfctl -f ./uploadtestapp.sh

Notice the use of --network=host this is crucial as without it the docker networking would prevent the sfctl container from connecting to the cluster container.

Note: At this point I saw a strange error from sfct the error was 500 code returned maximum retries exceeded unfortunately there was no way to get the body of the 500 code through sfctl. Never fear Postman to the rescue, I manually created the REST request so I could inspect the error:

curl -X POST \
  'http://localhost:19080/ApplicationTypes/$/Provision?api-version=3.0&timeout=60%3FApplicationTypeImageStorePath%3Dtestapp' \
  -H 'cache-control: no-cache' \
  -H 'content-type: application/json' \
  -d '{
        "ApplicationTypeBuildPath": "testapp"

}'

This returned

{
    "Error": {
        "Code": "FABRIC_E_IMAGEBUILDER_VALIDATION_ERROR",
        "Message": "The EntryPoint node is not found.\nFileName: /home/ClusterDeployer/ClusterData/Data/N0010/Fabric/work/ImageBuilderProxy/AppType/NodeAppType/WebServicePkg/ServiceManifest.xml"
    }
}

It hit me that node wasn’t present in the container running the cluster. I installed it with apt and then created a ‘node.sh’ file to start the app and pointed to this from my manifest. After this I got another message related to the ‘code’ folder not being found. This took a bit of time but I realized that on linux the file system is case sensitive and I had ‘Code’ not ‘code’, renaming the folder fixed that.

Making it route requests!

Well it turns out that ./ClusterDeployer.sh doesn’t configure things quite right for my use case. It uses the following to discover the clusters IPAddress:

IPAddr=`ifconfig eth0 2>/dev/null|awk '/inet addr:/ {print $2}'|sed 's/addr://'`

This means our containers IPAddress is used, well with my container that isn’t routable – we need ‘localhost’ to be used.

Here is the quick fix (HACK) added to the Dockerfile

FROM lawrencegripper/sfonebox
RUN apt-get install nodejs -y
RUN sed -i "s%IPAddr=.*%IPAddr=localhost%g" ClusterDeployer.sh

This worked and now means the cluster publishes the addresses of our services as http://localhost:[port] which means Traefik can pick these up and route to them.

All done!

Standard

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s