Monitoring Juniper RPM data with InfluxDB and Graphana

I’ve been tinkering with this for some time now, more as a personal interest project to present visual data, but it morphed a bit into an automation project as well which was interesting.

When running a site-to-site IPSec VPN, it’s critically important to estimate the performance of your various connectivity at the site. Typically, as a bare minimum these days, one might expect a primary broadband cable/DSL line with a backup dial/3G/4G connection.

Now, this is really the data-gathering portion (subset-of) of what SD-WAN is driving towards. Measure your available links, and select the best link based on a set of pre-defined criteria. This criteria may vary by deployment based on the requirements of the underlying applications in use.

With regards to available tools we can use for measurement on the Juniper SRX (today), let’s look at implementing RPM at each branch site. And before you get all giddy and wonder if you can perhaps configure a bunch of monitors at your headend devices hitting all your remote branches, let it be known you are limited to only 500 probes on a single device. This appears to be the case even on high end SRX. There is also a great degree of concern as to the accuracy of tests performed on the SRX as it’s a CPU bound device, meaning all the packet forwarding happens in the CPU. We typically will see a dedicated core for control-plane functions and a separate core for data-plane functions, but there is valid concern that a suddenly busy control plane (say, converging BGP and whatnot) might become busy enough to lazily respond to aggressive polling, ¬†skewing your results.

First things’s first. Let’s implement a couple of very simple RPM probes on our branch SRX.

I’m going to attempt to automate most of the delivery of this demo using Vagrant, vSRX, and an installation of Ubuntu such that a simple “vagrant up” will stand up much of this environment for you. For free. ūüėÄ

Most of the details of this setup are located here.

https://github.com/barnesry/Junos_RPM

But I’ll summarize the general steps required to make most of this work. Keep in mind these are specific to launching on a (read: my) MacBook, so your results may vary on other platforms.

Requirements

  1. Vagrant 1.8.1 (INSTALL)
  2. Ansible 2.0.1.0 (INSTALL)
  3. Juniper / Ansible Plugin (basically a PyEz port for Ansible)
    barnesry@barnesry-ubuntu16:~/Ansible$ sudo ansible-galaxy install Juniper.junos
    - downloading role 'junos', owned by Juniper
    - downloading role from https://github.com/Juniper/ansible-junos-stdlib/archive/1.3.1.tar.gz
    - extracting Juniper.junos to /etc/ansible/roles/Juniper.junos
    - Juniper.junos was installed successfully
  4. Virtualbox 5.0.20 (INSTALL)
  5. Git (either command line or desktop to clone into)

We’ll use Virtualbox to spin up our Ubuntu VM, and our vSRX VM on our host machine and Vagrant to handle the automation of this as Vagrant is really a nice CLI wrapper around vboxmanage so you can configure all your settings in a file and launch systems rather than having to click around the UI each time.

Vagrant supports a concept called a provisioner¬†whereby once a VM has been launched we can kick off an Ansible playbook and push a configuration generated dynamically at launch. We’ll also need the Juniper/Ansible plugin to allow Ansible to call specific functions against our Juniper equipment. This is really just a re-package of the more commonly used python PyEz library used for interacting with the device API via XML RPC. You can pull this plugin for python using the following command:

pip install junos-eznc

Launch

  1. Find a directory in which to clone the github repository and change into that directory.
  2. git clone https://github.com/barnesry/Junos_RPM.git
  3. vagrant up

This should, and I’ll use the word¬†should¬†carefully and even italicize it for extra effect, download a copy of Ubuntu via this line in the Vagrantfile “config.vm.box = “ubuntu/trusty64” and attempt to install and launch both Graphana and InfluxDB with default settings.

I will also attempt to download a copy of vSRX already configured in packet-mode (ie. not in FW flow mode) from Vagrant Atlas via this “vsrx.vm.box = ‘juniper/ffp-12.1X47-D15.4-packetmode'” line in the Vagrantfile. I cannot take credit for the Atlas package – I’m just using it.

Once launched, the vSRX should come up with ge-0/0/0 bridged to an external public NIC (mine is my WiFi card), and a ge-0/0/1 configured to connect to a virtual local network which is typically the host-only adapter called vboxnet0, which it happens to share with our newly launched Ubuntu server as well. FYI. The default login for this vSRX “box” is user: root pass: Juniper.

In addition, on launch you should also be asked which NIC you’d like to bridge. This will provide the internet connectivity for your VM to connect out, and pull down it’s required packages so choose one that provides your local machine the interwebs.

barnesry-mbp:Junos_RPM barnesry$ vagrant up
Bringing machine 'ubuntu-monitoring' up with 'virtualbox' provider...
Bringing machine 'vsrx' up with 'virtualbox' provider...
==> ubuntu-monitoring: Importing base box 'ubuntu/trusty64'...
==> ubuntu-monitoring: Matching MAC address for NAT networking...
==> ubuntu-monitoring: Checking if box 'ubuntu/trusty64' is up to date...
==> ubuntu-monitoring: A newer version of the box 'ubuntu/trusty64' is available! You currently
==> ubuntu-monitoring: have version '20160512.0.0'. The latest is version '20160708.1.2'. Run
==> ubuntu-monitoring: `vagrant box update` to update.
==> ubuntu-monitoring: Setting the name of the VM: Junos_RPM_ubuntu-monitoring_1468616834532_36360
==> ubuntu-monitoring: Clearing any previously set forwarded ports...
==> ubuntu-monitoring: Clearing any previously set network interfaces...
==> ubuntu-monitoring: Available bridged network interfaces:
1) en0: Wi-Fi (AirPort)
2) en1: Thunderbolt 1
3) en2: Thunderbolt 2
4) p2p0
5) awdl0
6) bridge0
7) vmnet1
8) vmnet4
9) vmnet5
10) vmnet7
11) vmnet8
12) en7: USB Ethernet
==> ubuntu-monitoring: When choosing an interface, it is usually the one that is
==> ubuntu-monitoring: being used to connect to the internet.
 ubuntu-monitoring: Which interface should the network bridge to? 1
==> ubuntu-monitoring: Preparing network interfaces based on configuration...
 ubuntu-monitoring: Adapter 1: nat
 ubuntu-monitoring: Adapter 2: intnet
 ubuntu-monitoring: Adapter 3: bridged
==> ubuntu-monitoring: Forwarding ports...
 ubuntu-monitoring: 3000 (guest) => 3000 (host) (adapter 1)
 ubuntu-monitoring: 8080 (guest) => 8080 (host) (adapter 1)
 ubuntu-monitoring: 8083 (guest) => 8083 (host) (adapter 1)
 ubuntu-monitoring: 8086 (guest) => 8086 (host) (adapter 1)
 ubuntu-monitoring: 22 (guest) => 2222 (host) (adapter 1)
==> ubuntu-monitoring: Running 'pre-boot' VM customizations...
==> ubuntu-monitoring: Booting VM...
==> ubuntu-monitoring: Waiting for machine to boot. This may take a few minutes...
 ubuntu-monitoring: SSH address: 127.0.0.1:2222
 ubuntu-monitoring: SSH username: vagrant
 ubuntu-monitoring: SSH auth method: private key
 ubuntu-monitoring: Warning: Remote connection disconnect. Retrying...
 ubuntu-monitoring: Warning: Remote connection disconnect. Retrying...
==> ubuntu-monitoring: Machine booted and ready!
==> ubuntu-monitoring: Running provisioner: shell...
 ubuntu-monitoring: Running: inline script
==> ubuntu-monitoring: stdin: is not a tty
==> ubuntu-monitoring: Get:1 http://security.ubuntu.com trusty-security InRelease [65.9 kB]
==> ubuntu-monitoring: Ign http://archive.ubuntu.com trusty InRelease
==> ubuntu-monitoring: Get:2 http://archive.ubuntu.com trusty-updates InRelease [65.9 kB]
==> ubuntu-monitoring: Get:3 http://security.ubuntu.com trusty-security/main Sources [118 kB]
==> ubuntu-monitoring: Get:4 http://archive.ubuntu.com trusty-backports InRelease [65.9 kB]

<...snip...all the install stuff happens here ...snip...>

==> ubuntu-monitoring: * Starting Grafana Server
==> ubuntu-monitoring: ...done.
==> ubuntu-monitoring: Processing triggers for ureadahead (0.100.0-16) ...
==> ubuntu-monitoring: Starting the process influxdb [ OK ]
==> ubuntu-monitoring: influxdb process was started [ OK ]
==> ubuntu-monitoring: * Starting Grafana Server
==> ubuntu-monitoring: * Already running.
==> ubuntu-monitoring: ...done.

Next, once the above process has completed, Vagrant will also launch the vSRX followed by the Ansible provisioner which will merge the template config and push this configuration to the newly launched vSRX device and startRPM polling to various points on the internet. RPM will store those results into it’s 1024 position rolling memory table. We’ll poll this table using our python script on our Ubuntu server in a second.

==> vsrx: Importing base box 'juniper/ffp-12.1X47-D15.4-packetmode'...
==> vsrx: Matching MAC address for NAT networking...
==> vsrx: Checking if box 'juniper/ffp-12.1X47-D15.4-packetmode' is up to date...
==> vsrx: Setting the name of the VM: vagrant-vsrx
==> vsrx: Fixed port collision for 22 => 2222. Now on port 2200.
==> vsrx: Clearing any previously set network interfaces...
==> vsrx: Preparing network interfaces based on configuration...
 vsrx: Adapter 1: nat
 vsrx: Adapter 2: intnet
==> vsrx: Forwarding ports...
 vsrx: 22 (guest) => 2200 (host) (adapter 1)
==> vsrx: Running 'pre-boot' VM customizations...
==> vsrx: Booting VM...
==> vsrx: Waiting for machine to boot. This may take a few minutes...
 vsrx: SSH address: 127.0.0.1:2200
 vsrx: SSH username: vagrant
 vsrx: SSH auth method: private key
 vsrx: Warning: Remote connection disconnect. Retrying...
 vsrx: Warning: Remote connection disconnect. Retrying...
==> vsrx: Machine booted and ready!
==> vsrx: Setting hostname...
==> vsrx: Running provisioner: ansible...
 vsrx: Running ansible-playbook...

If your Ansible fails to connect (I was getting this) complaining about the following, try adding the vagrant insecure key to your ssh-agent

TASK [Deploy config to device ... please wait] *********************************
task path: /Users/barnesry/PycharmProjects/Junos_RPM/provisioning/playbook-deploy-config.yaml:15
ESTABLISH LOCAL CONNECTION FOR USER: barnesry
127.0.0.1 EXEC /bin/sh -c '( umask 22 && mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1468620084.17-200326305408311 `" && echo "` echo $HOME/.ansible/tmp/ansible-tmp-1468620084.17-200326305408311 `" )'
127.0.0.1 PUT /var/folders/p3/7cq2wpk943d331zlsr808k2c001364/T/tmpo7LD9L TO /Users/barnesry/.ansible/tmp/ansible-tmp-1468620084.17-200326305408311/junos_install_config
127.0.0.1 EXEC /bin/sh -c 'LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/env python /Users/barnesry/.ansible/tmp/ansible-tmp-1468620084.17-200326305408311/junos_install_config; rm -rf "/Users/barnesry/.ansible/tmp/ansible-tmp-1468620084.17-200326305408311/" > /dev/null 2>&1'
fatal: [vsrx]: FAILED! => {"changed": false, "failed": true, "invocation": {"module_args": {"comment": null, "confirm": null, "console": null, "diffs_file": null, "file": "/tmp/vsrx.conf", "host": "127.0.0.1", "logfile": "/tmp/changes.log", "overwrite": "yes", "passwd": null, "port": 2200, "replace": "no", "savedir": null, "timeout": 0, "user": "vagrant"}, "module_name": "junos_install_config"}, "msg": "unable to connect to 127.0.0.1: ConnectAuthError(127.0.0.1)"}

When I looked in the automatically generated Ansible inventory file, all the parameters look OK. Namely – I’m seeing¬†ansible_user passed through as ‘vagrant’ as defined in the Vagrantfile “vsrx.ssh.username = ‘vagrant'” line, otherwise it will attempt to log into the device using your current username, which you probably don’t want. I further confirmed it was connecting correctly to localhost:2200, but for some reason it was having issues passing the insecure public key found in vagrant.d directory.

barnesry-mbp:Junos_RPM barnesry$ cat /Users/barnesry/PycharmProjects/Junos_RPM/.vagrant/provisioners/ansible/inventory/vagrant_ansible_inventory
# Generated by Vagrant

ubuntu-monitoring ansible_ssh_host=127.0.0.1 ansible_ssh_port=2222 ansible_ssh_user='vagrant' ansible# Generated by Vagrant
_ssh_private_key_file='/Users/barnesry/.vagrant.d/insecure_private_key'
vsrx ansible_ssh_host=127.0.0.1 ansible_ssh_port=2200 ansible_ssh_user='vagrant' ansible_ssh_private_key_file='/Users/barnesry/.vagrant.d/insecure_private_key'

[vsrx]

[all:children]
vsrx

Once I loaded this key into my ssh-agent ¬†I was able to successfully provision the vSRX. I’ll dig into why this is when I get some more time to troubleshoot.

ssh-add /Users/barnesry/vagrant.d/insecure_private_key

Now we can try and ssh into our vSRX. Success!

barnesry-mbp:Junos_RPM barnesry$ vagrant ssh vsrx
--- JUNOS 12.1X47-D15.4 built 2014-11-12 02:13:59 UTC
vagrant@vsrx>

You can force reprovisioning of your vSRX (if it failed) by issuing

vagrant provision vsrx

Configure

Hopefully, this last try worked for you. Now let’s confirm our settings have been applied correctly.

barnesry-mbp:Junos_RPM barnesry$ vagrant ssh vsrx
--- JUNOS 12.1X47-D15.4 built 2014-11-12 02:13:59 UTC

vagrant@vsrx> show interfaces terse | match inet
ge-0/0/0.0 up up inet 10.0.2.15/24
sp-0/0/0.0 up up inet
 inet6
sp-0/0/0.16383 up up inet 10.0.0.1 --> 10.0.0.16
ge-0/0/1.0 up up inet 192.168.56.107/24  <-- here's our local mgmt
lo0.0 up up inet 192.168.0.1 --> 0/0
lo0.16384 up up inet 127.0.0.1 --> 0/0
lo0.16385 up up inet 10.0.0.1 --> 0/0

vagrant@vsrx> ping 192.168.56.199  <-- can we ping our server?
PING 192.168.56.199 (192.168.56.199): 56 data bytes
64 bytes from 192.168.56.199: icmp_seq=0 ttl=64 time=19.662 ms
^C
--- 192.168.56.199 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 19.662/19.662/19.662/0.000 ms


vagrant@vsrx> show configuration services | display set
set services rpm probe test-owner-rpm test http probe-type http-get
set services rpm probe test-owner-rpm test http target url http://www.google.com
set services rpm probe test-owner-rpm test http probe-count 5
set services rpm probe test-owner-rpm test http probe-interval 60
set services rpm probe test-owner-rpm test http test-interval 60
set services rpm probe test-owner-rpm test http history-size 10
set services rpm probe test-owner-rpm test http thresholds successive-loss 2
set services rpm probe test-owner-rpm test http thresholds total-loss 3
set services rpm probe test-owner-rpm test http traps test-failure
set services rpm probe dns-rpm test ping probe-type icmp-ping
set services rpm probe dns-rpm test ping target address 8.8.8.8
set services rpm probe dns-rpm test ping probe-count 5
set services rpm probe dns-rpm test ping probe-interval 60
set services rpm probe dns-rpm test ping test-interval 60
set services rpm probe dns-rpm test ping history-size 10
set services rpm probe dns-rpm test ping thresholds successive-loss 2
set services rpm probe dns-rpm test ping thresholds total-loss 3
set services rpm probe dns-rpm test ping traps test-failure

vagrant@vsrx> ping 8.8.8.8   <-- can we ping the interwebs?
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=63 time=111.209 ms
^C
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 111.209/111.209/111.209/0.000 ms

vagrant@vsrx> show services rpm history-results <-- let's check some results
 Owner, Test Probe received Round trip time
 dns-rpm, ping Fri Jul 15 15:51:01 2016 100218 usec
 dns-rpm, ping Fri Jul 15 15:52:01 2016 99053 usec
 dns-rpm, ping Fri Jul 15 15:53:01 2016 101603 usec
 dns-rpm, ping Fri Jul 15 15:54:01 2016 110142 usec
 dns-rpm, ping Fri Jul 15 15:55:01 2016 101889 usec
 dns-rpm, ping Fri Jul 15 15:56:02 2016 100131 usec
 dns-rpm, ping Fri Jul 15 15:57:02 2016 110226 usec
 dns-rpm, ping Fri Jul 15 15:58:02 2016 100236 usec
 dns-rpm, ping Fri Jul 15 15:59:02 2016 100083 usec
 dns-rpm, ping Fri Jul 15 16:00:02 2016 101544 usec
 test-owner-rpm, http Fri Jul 15 15:49:01 2016 250198 usec
 test-owner-rpm, http Fri Jul 15 15:50:01 2016 268652 usec
 test-owner-rpm, http Fri Jul 15 15:52:01 2016 269524 usec
 test-owner-rpm, http Fri Jul 15 15:53:01 2016 250041 usec
 test-owner-rpm, http Fri Jul 15 15:54:01 2016 260565 usec
 test-owner-rpm, http Fri Jul 15 15:55:01 2016 270286 usec
 test-owner-rpm, http Fri Jul 15 15:56:01 2016 260264 usec
 test-owner-rpm, http Fri Jul 15 15:58:01 2016 319523 usec
 test-owner-rpm, http Fri Jul 15 15:59:01 2016 260195 usec
 test-owner-rpm, http Fri Jul 15 16:00:01 2016 251111 usec

Ok. Let’s summarize where we’re at.

  • I can reach my Graphana/Influx server on my vboxnet0 virtual network.
  • I can reach the internet as validated by ping
  • My configuration has successfully been provisioned by Ansible to my vSRX
  • RPM is busy polling every minute and logging this data to a history table

Let’s see if our other services are working as expected.

If we’ve done everything right, we should have port 8083 port forwarded from our localhost to our Ubuntu server on which InfluxDB should be listening. Let’s check.

Screen Shot 2016-07-15 at 4.08.20 PM

This looks good so far. How about Graphana?

Screen Shot 2016-07-15 at 4.10.04 PM

Also looks good.

Now we just need some data to look at, so let’s go and collect some. For that we’ll need to kick off our netconf-poll.py script to log into the vSRX, grab our RPM data and and insert it into our time series database. It’s worth noting the data flow here, and this is intentional.

  1. My python collection script is to be run¬†locally from my macbook. This is typically where I’d run my scripts from if I were polling some other network device.
  2. The server is accessible via vboxnet0 (which is default in most Vbox installs i’ve seen) and we’ll be collecting the RPM data and posting it via REST to InfluxDB hosted on the Ubuntu server. This is simply another attached network to your laptop.
  3. The vSRX is also accessible via vboxnet0. It’s polling the internet via one interface (ge-0/0/0 which is bridged to your external NIC), and we’re attaching to it on the internal NIC, and pulling off RPM data on it’s second (ge-0/0/1) interface.

You can confirm all this as well from the Virtualbox GUI as noted here.

Screen Shot 2016-07-15 at 4.18.37 PM.png

Let’s kick off our script. Since this is a demo – we’re hardcoded to attach to 192.168.56.107, but this could easily be modified to reflect whatever box you want to hit, or load up argparse or similar to pass those though to the script directly.

barnesry-mbp:Junos_RPM barnesry$ ./netconf-poll.py
2016-07-15 16:24:47,282 - INFO - Connected (version 2.0, client OpenSSH_6.6)
2016-07-15 16:24:48,373 - INFO - Authentication (publickey) failed.
2016-07-15 16:24:48,422 - INFO - Authentication (password) successful!
2016-07-15 16:24:48,860 - INFO - initialized: session-id=1242 | server_capabilities=['http://xml.juniper.net/dmi/system/1.0', 'urn:ietf:params:xml:ns:netconf:capability:confirmed-commit:1.0', 'http://xml.juniper.net/netconf/junos/1.0', 'urn:ietf:params:xml:ns:netconf:capability:validate:1.0', 'urn:ietf:params:xml:ns:netconf:capability:candidate:1.0', 'urn:ietf:params:xml:ns:netconf:capability:url:1.0?protocol=http,ftp,file', 'urn:ietf:params:xml:ns:netconf:base:1.0']
2016-07-15 16:24:48,869 - INFO - Requesting 'ExecuteRpc'
2016-07-15 16:24:49,055 - INFO - Requesting 'ExecuteRpc'
2016-07-15 16:24:49,178 - INFO - Requesting 'ExecuteRpc'
2016-07-15 16:24:49,306 - INFO - Requesting 'ExecuteRpc'
2016-07-15 16:24:49,434 - INFO - Requesting 'ExecuteRpc'
2016-07-15 16:24:49,616 - INFO - Requesting 'ExecuteRpc'
2016-07-15 16:24:49,740 - INFO - Requesting 'ExecuteRpc'
2016-07-15 16:24:49,865 - INFO - Connected to vsrx.thelab.net, FIREFLY-PERIMETER running 12.1X47-D15.4
2016-07-15 16:24:49,866 - INFO - Connected to InfluxDB
2016-07-15 16:24:49,880 - INFO - Starting new HTTP connection (1): localhost
2016-07-15 16:24:49,888 - INFO - Staring metrics collection...
2016-07-15 16:24:49,888 - INFO - Requesting 'ExecuteRpc'
<type 'list'>
2016-07-15 16:24:50,017 - WARNING - Element rtt not returned
2016-07-15 16:24:50,019 - INFO - Resetting dropped connection: localhost
2016-07-15 16:24:50,044 - WARNING - Element rtt not returned
2016-07-15 16:24:50,047 - WARNING - Element rtt not returned
2016-07-15 16:24:50,049 - WARNING - Element rtt not returned
2016-07-15 16:24:50,051 - WARNING - Element rtt not returned
2016-07-15 16:24:50,053 - WARNING - Element rtt not returned
2016-07-15 16:24:50,056 - INFO - Sleeping for 600 seconds

Screen Shot 2016-07-15 at 4.25.40 PMSo, did this work? Let’s check. You’ll notice if you refresh¬†your Influx screen where previously _internal was the only database available, there should now be one called network. This was created on first run by the script.

First, select the¬†network database in the upper right corner, then let’s check which measurements were inserted.

Screen Shot 2016-07-15 at 4.29.18 PM

OK, so we have something here. Let’s dig a little further.

Screen Shot 2016-07-15 at 4.30.41 PM

Excellent, we have data. Now we need to tell Graphana where to pull it’s metrics from so it can graph them so we’ll point it at InfluxDB. I was hoping to automate all this manual config, but never quite made it there…TBD ūüôā

Screen Shot 2016-07-15 at 4.33.21 PMDefault graphana login is admin/admin if you’re wondering. Once logged in, let’s navigate to Data Sources/Add Data Source.

Screen Shot 2016-07-15 at 4.38.01 PMYou can plug in data similar to this screenshot, using the default root:root auth for Influx (don’t do this in production!). Then Save & Test.

 

Hopefully when you saved, it was successful. We can now navigate back to our main dashboard page, and create a New dashboard. For Panel Data Source select ‘rpm_data’. If you don’t see this option – Graphana Screen Shot 2016-07-15 at 4.41.39 PMmay not have¬†made a successful connection to the InfluxDB so there might be some additional troubleshoot to do there.

This should open up a SELECT area where you can enter in your queries. I fought with this for a bit, but there’s a small button on the right side for “Toggle Edit Mode” where you can paste your query right in rather than fighting with the field/form entry. I’ve included the queries here for easy copy/paste.

 

SELECT value/1000 FROM “rpm_history” WHERE owner=’test-owner-rpm’

SELECT value/1000 FROM “rpm_history” WHERE owner=’dns-rpm’

Screen Shot 2016-07-15 at 4.44.28 PM

You should end up with something that looks similar to this. Make sure you click SAVE at the top, or you’ll lose your changes. There’s also a typo in my screenshot w/r/t owner – it needs to match the data in the DB, and I had changed something somewhere…

Screen Shot 2016-07-15 at 4.45.43 PM

View

I had to adjust my time (zoom out) to find my data points as my times aren’t sync’d between my router/server/database but there they are. I can now track my HTTP/ICMP RTT times in a nice graphical interface.

Screen Shot 2016-07-15 at 4.48.23 PM

This just begins to scratch the surface of what’s possible, but it’s a good way to get the juices flowing when it comes to the types of metrics you’d like to get from¬†your network.

 

Advertisements

Where exactly are Pokemon Go servers hosted?

I’ll admit. I don’t really follow video games that closely. Mostly owing to the fact that I’m frankly too busy with other things to fill my time with them these days. It’s not that I shun them for any particular reason. I used to be a fairly big gamer, but those days are long gone.

So then, I was surprised this morning by a bunch of articles on the (un)availability of the services tasked with keeping hordes of Pokemon Go players happy this past week.

http://www.gamespot.com/articles/pokemon-gos-international-rollout-paused-as-server/1100-6441650/

I have a 6yr old daughter who is infatuated with Pokemon these days thanks to Netflix, but I honestly don’t have a clue what they are, or why anyone would like them. I’ll let you come to your own opinions on the merits of the game, but it does have a pretty interesting history for us technical folks. To save you some reading, Niantic is a startup spun out of Google in 2015, birthed from John Hanke, father of Keyhole, which was aquired by Google eventually became Google Earth.

During a work conversation today, a question came up that I thought for sure would be easy to answer. If Niantic’s services are being crushed by load, where are they hosted? I figured a quick google search would unearth the answer to this query, but alas my Google foo was not up to the task. So let’s dig in, shall we?

Pokemon Go has recently been released on Android and iOS. I an iOS user, so it’s not quite as easy as simply running a tcpdump on my laptop while running the app and finding the source (unless you’re jailbroken and whatnot). There will be a myriad of ways to do this by the way, but I thought documenting one of the ways to grab this data might be interesting as I learned a few things along the way as well so why not share it.

174pob

First things first. Let’s ¬†grab a .pcap of the data coming off my iOS device so we can analyse our outbound traffic and find out were it’s headed.

A quick check with tcpdump on en0 interface ( my wireless NIC ) was turning up only broadcast/multicast traffic only, but not traffic which I expected should be in flight in the air which I expected to see. After all, wireless IS shared media, right?

barnesry-mbp:~ barnesry$ ifconfig en0
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
 ether 60:f8:1d:c4:3b:38
 inet6 fe80::62f8:1dff:fec4:3b38%en0 prefixlen 64 scopeid 0x4
 inet 10.0.1.10 netmask 0xffffff00 broadcast 10.0.1.255
 nd6 options=1<PERFORMNUD>
 media: autoselect
 status: active

barnesry-mbp:$ sudo tcpdump -i en0 host 10.0.1.3
Password:
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on en0, link-type EN10MB (Ethernet), capture size 65535 bytes
23:16:16.338903 IP 10.0.1.3.mdns > 224.0.0.251.mdns: 0 [2q] [1au] PTR (QU)? _airplay._tcp.local. PTR (QU)? _raop._tcp.local. (78)
23:16:16.645533 ARP, Request who-has 10.0.1.3 tell 10.0.1.13, length 28

What then to do?

If you want to sniff all the wireless traffic around, you’ll need to drop your wireless card into ‘monitor’ mode. This is also outlined in the tcpdump man page in OSX and will result in passing up everything from the 802.11 layer including wireless SNR, etc but makes the capture much more interesting!

-I Put the interface in “monitor mode”; this is supported only on IEEE 802.11
Wi-Fi interfaces, and supported only on some operating systems.

barnesry-mbp:~ barnesry$ sudo tcpdump -I -i en0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on en0, link-type IEEE802_11_RADIO (802.11 plus radiotap header), capture size 65535 bytes
23:09:30.065941 4111060402us tsft 1.0 Mb/s 2452 MHz 11g -75dB signal -92dB noise antenna 0 Beacon (CenturyLink6989) [1.0* 2.0* 5.5* 11.0* 18.0 24.0 36.0 54.0 Mbit] ESS CH: 9, PRIVACY
23:09:30.075347 4111071507us tsft 1.0 Mb/s 2452 MHz 11g -92dB noise antenna 0
23:09:30.075625 4111071935us tsft 1.0 Mb/s 2452 MHz 11g -92dB noise antenna 0 Acknowledgment RA:b8:c7:5d:12:8e:f5 (oui Unknown)
23:09:30.112349 4111106996us tsft 1.0 Mb/s 2452 MHz 11g -92dB noise antenna 0 Beacon (Goiter) [1.0* 2.0* 5.5* 11.0* 6.0 9.0 12.0 18.0 Mbit] ESS CH: 9, PRIVACY
23:09:30.168411 4111162822us tsft 1.0 Mb/s 2452 MHz 11g -76dB signal -92dB noise antenna 0 Beacon (CenturyLink6989) [1.0* 2.0* 5.5* 11.0* 18.0 24.0 36.0 54.0 Mbit] ESS CH: 9, PRIVACY
23:09:30.214709 4111209396us tsft 1.0 Mb/s 2452 MHz 11g -38dB signal -92dB noise antenna 0 Beacon (Goiter) [1.0* 2.0* 5.5* 11.0* 6.0 9.0 12.0 18.0 Mbit] ESS CH: 9, PRIVACY
23:09:30.270764 4111265204us tsft 1.0 Mb/s 2452 MHz 11g -73dB signal -92dB noise antenna 0 Beacon (CenturyLink6989) [1.0* 2.0* 5.5* 11.0* 18.0 24.0 36.0 54.0 Mbit] ESS CH: 9, PRIVACY
23:09:30.284515 4111280917us tsft short preamble 24.0 Mb/s 2452 MHz 11g -41dB signal -92dB noise antenna 0 Clear-To-Send RA:60:f8:1d:c4:3b:38 (oui Unknown)

What I also didn’t know is there’s a built in wireless diagnostic tool in OSX which can help with this and the process is documented HERE. This will result in a .wpcap file dropped on your desktop containing all wireless traffic snorted off your wifi card. Cool.

I closed all the applications on my phone, and turned it off prior to starting the capture. We will need the initial handshake captured when the phone associates to the AP in order to decode the traffic later.

So we start the wireless capture (and my timer) and turn the phone on.

Screen Shot 2016-07-08 at 11.33.29 PMWithin about 30 seconds, the phone is on. It’s likely there will a bunch of stuff happening on boot, so we’ll wait a couple minutes for everything to settle down before launching anything which will make triangulation of our game traffic much easier. Then we’ll log in, move around a bit, then close down the capture.

 

Let’s load that resulting .wpcap into Wireshark for a look.

This is interesting… we’ve got data, but we can’t read much of it. That’s because it’s encrypted with WPA2-PSK (PSK=pre-shared-key) or WPA2-Personal if you’re my Airport Extreme AP. So we’ll need to decrypt the data to get anything useful here. We’ll do that in a bit.

Screen Shot 2016-07-08 at 11.37.49 PM

First, let’s narrow our search down a bit, we’ll need the MAC address of my iPhone. We can get this by navigating to Settings–>General–>About–>Wifi Address on the phone. In my case the MAC address ends in 14:18 which should be good enough to filter on.

Rather than pick through the capture looking for this, or using a display filter let’s use some of the built in tools Wireshark provides to help us.

Navigate to Statistics–>WLAN Traffic.

This will open a window similar to below. I filtered on Ch. 9 to find my AP (Goiter – I name my APs after human afflictions) which then displays the packet statistics by MAC address for my AP. Cool right? Remember the last two octets 14:18 from my phone? It’s responsible for 31.56% of the traffic on this AP.

Screen Shot 2016-07-08 at 11.44.30 PM

Now – to filter traffic to ONLY my iPhone, let’s apply a filter to the capture. Right click on the MAC, and choose “Apply As Filter”

Screen Shot 2016-07-08 at 11.48.06 PM

Now we’re only looking at filtered data specific to my phone, but now we need to decode it. You can do this as outlined in the Wireshark Wiki (since you already know your WPA passphrase, right?)

In my case, I chose Edit–>Preferences, then Select Protocols and browse down the IEEE 802.11. From here, ensure Enable Decryption is checked, and Edit your keys.¬†Screen Shot 2016-07-08 at 11.52.04 PM

Here I used wpa-pwd and simply entered my SSID passphrase in cleartext (I erased it for this screenshot).

Alternatively, I believe you can also select wpa-psk but you’ll need to enter in the full 64 digit hex key, as outlined here. The cleartext version worked for me so I didn’t bother with the latter.

Now, we’ve got a full decrypted traffic stream we can work with. You’ll also notice in the bottom status bar I’ve captured 350k odd packets, and I’m displaying only 70k of them (my iPhone) so let’s chuck the rest as we don’t need to work with such a big file. Choose File–>Export Specified Packets, and write our file out containing only the Displayed packets we’re interested in.Then re-open our smaller (more specific) file.

Analysis

Time for some¬†analysis. Choose Statistics–>Conversations. This will net us a table of each pair of IP endpoints, so we can get a better view of how much data was flowing between pairs of machines¬†during the capture. I found sorting by Rel Start to be useful, as I could match the appearance of certain IP addresses against my testing timeline. (Remember I waited for 120 or so seconds to start my test?)

Screen Shot 2016-07-09 at 12.07.22 AM

Excellent. There’s a notable gap in activity between 75s and 171s, so let’s start there and assume we probably kicked things off around 171 seconds into the capture.

For a high level view of what’s going on, DNS is a good place to start so let’s filter for that in the display filter using¬†dns.

 

Screen Shot 2016-07-09 at 12.18.28 AMSome interesting stuff to glean here. It looks like our first call is at 171s into the capture as well, attempting to resolve pgorelease.nianticlabs.com. Bingo.

This very likely confirms our relative start time. There’s some other queries for upsight and kontagent, which a quick google search will reveal are mobile analytics companies, likely in play here as well providing data back to the publisher. Next we have accounts.google.com which is also explainable as I logged in using my Google account. The interesting stuff ends around 209s when I snapped a picture on my phone, and it immediately went to upload it to iCloud. More on that in a bit. ūüôā

Here’s the key IP pairs (I think) we should be interested in.

  1. storage.googleapis.com -> which CNAMES to 216.58.217.48 (Whois 1E100net. aka. google CDN)
  2. cl3.apple.com.edgekey.net -> which CNAMES to akamai CDN

Screen Shot 2016-07-09 at 12.27.39 AM

Let’s see if our traffic profiles confirm some of this. Navigate to Statistics–>Conversations. I’ve selected IPv4 here, and sorted by Rel Start which should allow us to rule out any traffic occurring before relative timestamp 171sec.

Screen Shot 2016-07-09 at 12.31.06 AM

Since our game should be pretty chatty over time as we load the game, and report location, and receive game status I’d expect a high packet count, but perhaps low byte count. (small packets, but lots of them). We’ve actually got four¬†good candidates here.

  1. 216.58.217.48 – google 1e100.net (Denver)
  2. 216.58.216.129 – google 1e100.net (Seattle)
  3. 130.211.188.132 – googleusercontent.com
  4. 104.6.203.30 – Akamai via cl3.apple.com.edgekey.net

 

To resolve IP addresses to owners as I’ve done above, we’ll need to perform some WhoIs lookups to see who owns some of this IP space.

For this I personally like to use network-tools.com which has a good suite of tools, available in a consumable format. You can also use your trusty command line tools as well to accomplish this. The example below came from konagent.net, who apparently host with Softlayer.

barnesry-mbp:~ barnesry$ whois -h whois.radb.net 50.23.68.199
route: 50.23.64.0/18
descr: customer Alestra
origin: AS32098
mnt-by: MAINT-AS32098
changed: mmg@transtelco.net 20150618 #18:44:54Z
source: RADB

route: 50.23.64.0/18
descr: auto-generated route object for 50.23.64.0/18
origin: AS11172
mnt-by: MAINT-AS11172
changed: om_core@alestra.com.mx 20141031 #09:44:53Z
source: RADB

route: 50.23.64.0/18
descr: SOFTLAYER-sjc01
origin: AS36351
notify: noc@softlayer.com
mnt-by: MAINT-AS36351
changed: ipadmin@softlayer.com 20110104
source: RADB

route: 50.23.64.0/18
descr: REACH (Customer Route)
tech-c: RRNOC1-REACH
origin: AS36351
remarks: This auto-generated route object was created
remarks: for a REACH customer route
remarks:
remarks: This route object was created because
remarks: some REACH peers filter based on these objects
remarks: and this route may be rejected
remarks: if this object is not created.
remarks:
remarks: Please contact irr@team.telstra.com if you have any
remarks: questions regarding this object.
notify: irr@team.telstra.com
mnt-by: MAINT-REACH-NOC
changed: irr@team.telstra.com 20140206
source: REACH

Since Niantic spun out of Google in 2015, it’s a pretty good guess they’re hosted in Google Compute Engine of which their two IP’s listed above account for ~700pkts (or about 1/2 of the traffic from the big four) listed earlier.

I think we can further discount the IP address 104.6.203.30 as some background Apple stuff hosted in Akamai perhaps related to game launch (validation and such). This leaves… Google Compute Engine.

We can also rule out the large byte transfer at Rel Start 225.647351 as a deep dive on that lines up with our DNS query for Amazon AWS Oregon S3 bucket, preceded by some chatter to Apple CDN servers. This is further confirmed above by a whois on 54.231.163.58 (AWS-S3) and also by the upload ratio for this conversation… stuff going up, up, up into the cloud when I snapped my picture on my iPhone.Screen Shot 2016-07-09 at 12.37.22 AM

Summary

The intent of this post was initially to solve my earlier question of where Niantic was hosting Pokemon Go, which perhaps could have easier been assumed simply by looking at where Niantic was birthed (Google), but where is the fun in that?

The end result is less a proof point of this fact, but more a view into various troubleshooting tools and methodologies available to really deep dive into various network traffic patterns on the internet.

If you’ve gotten this far, hopefully it was an interesting enough read.