Monitoring Juniper RPM data with InfluxDB and Graphana

I’ve been tinkering with this for some time now, more as a personal interest project to present visual data, but it morphed a bit into an automation project as well which was interesting.

When running a site-to-site IPSec VPN, it’s critically important to estimate the performance of your various connectivity at the site. Typically, as a bare minimum these days, one might expect a primary broadband cable/DSL line with a backup dial/3G/4G connection.

Now, this is really the data-gathering portion (subset-of) of what SD-WAN is driving towards. Measure your available links, and select the best link based on a set of pre-defined criteria. This criteria may vary by deployment based on the requirements of the underlying applications in use.

With regards to available tools we can use for measurement on the Juniper SRX (today), let’s look at implementing RPM at each branch site. And before you get all giddy and wonder if you can perhaps configure a bunch of monitors at your headend devices hitting all your remote branches, let it be known you are limited to only 500 probes on a single device. This appears to be the case even on high end SRX. There is also a great degree of concern as to the accuracy of tests performed on the SRX as it’s a CPU bound device, meaning all the packet forwarding happens in the CPU. We typically will see a dedicated core for control-plane functions and a separate core for data-plane functions, but there is valid concern that a suddenly busy control plane (say, converging BGP and whatnot) might become busy enough to lazily respond to aggressive polling, ¬†skewing your results.

First things’s first. Let’s implement a couple of very simple RPM probes on our branch SRX.

I’m going to attempt to automate most of the delivery of this demo using Vagrant, vSRX, and an installation of Ubuntu such that a simple “vagrant up” will stand up much of this environment for you. For free. ūüėÄ

Most of the details of this setup are located here.

https://github.com/barnesry/Junos_RPM

But I’ll summarize the general steps required to make most of this work. Keep in mind these are specific to launching on a (read: my) MacBook, so your results may vary on other platforms.

Requirements

  1. Vagrant 1.8.1 (INSTALL)
  2. Ansible 2.0.1.0 (INSTALL)
  3. Juniper / Ansible Plugin (basically a PyEz port for Ansible)
    barnesry@barnesry-ubuntu16:~/Ansible$ sudo ansible-galaxy install Juniper.junos
    - downloading role 'junos', owned by Juniper
    - downloading role from https://github.com/Juniper/ansible-junos-stdlib/archive/1.3.1.tar.gz
    - extracting Juniper.junos to /etc/ansible/roles/Juniper.junos
    - Juniper.junos was installed successfully
  4. Virtualbox 5.0.20 (INSTALL)
  5. Git (either command line or desktop to clone into)

We’ll use Virtualbox to spin up our Ubuntu VM, and our vSRX VM on our host machine and Vagrant to handle the automation of this as Vagrant is really a nice CLI wrapper around vboxmanage so you can configure all your settings in a file and launch systems rather than having to click around the UI each time.

Vagrant supports a concept called a provisioner¬†whereby once a VM has been launched we can kick off an Ansible playbook and push a configuration generated dynamically at launch. We’ll also need the Juniper/Ansible plugin to allow Ansible to call specific functions against our Juniper equipment. This is really just a re-package of the more commonly used python PyEz library used for interacting with the device API via XML RPC. You can pull this plugin for python using the following command:

pip install junos-eznc

Launch

  1. Find a directory in which to clone the github repository and change into that directory.
  2. git clone https://github.com/barnesry/Junos_RPM.git
  3. vagrant up

This should, and I’ll use the word¬†should¬†carefully and even italicize it for extra effect, download a copy of Ubuntu via this line in the Vagrantfile “config.vm.box = “ubuntu/trusty64” and attempt to install and launch both Graphana and InfluxDB with default settings.

I will also attempt to download a copy of vSRX already configured in packet-mode (ie. not in FW flow mode) from Vagrant Atlas via this “vsrx.vm.box = ‘juniper/ffp-12.1X47-D15.4-packetmode'” line in the Vagrantfile. I cannot take credit for the Atlas package – I’m just using it.

Once launched, the vSRX should come up with ge-0/0/0 bridged to an external public NIC (mine is my WiFi card), and a ge-0/0/1 configured to connect to a virtual local network which is typically the host-only adapter called vboxnet0, which it happens to share with our newly launched Ubuntu server as well. FYI. The default login for this vSRX “box” is user: root pass: Juniper.

In addition, on launch you should also be asked which NIC you’d like to bridge. This will provide the internet connectivity for your VM to connect out, and pull down it’s required packages so choose one that provides your local machine the interwebs.

barnesry-mbp:Junos_RPM barnesry$ vagrant up
Bringing machine 'ubuntu-monitoring' up with 'virtualbox' provider...
Bringing machine 'vsrx' up with 'virtualbox' provider...
==> ubuntu-monitoring: Importing base box 'ubuntu/trusty64'...
==> ubuntu-monitoring: Matching MAC address for NAT networking...
==> ubuntu-monitoring: Checking if box 'ubuntu/trusty64' is up to date...
==> ubuntu-monitoring: A newer version of the box 'ubuntu/trusty64' is available! You currently
==> ubuntu-monitoring: have version '20160512.0.0'. The latest is version '20160708.1.2'. Run
==> ubuntu-monitoring: `vagrant box update` to update.
==> ubuntu-monitoring: Setting the name of the VM: Junos_RPM_ubuntu-monitoring_1468616834532_36360
==> ubuntu-monitoring: Clearing any previously set forwarded ports...
==> ubuntu-monitoring: Clearing any previously set network interfaces...
==> ubuntu-monitoring: Available bridged network interfaces:
1) en0: Wi-Fi (AirPort)
2) en1: Thunderbolt 1
3) en2: Thunderbolt 2
4) p2p0
5) awdl0
6) bridge0
7) vmnet1
8) vmnet4
9) vmnet5
10) vmnet7
11) vmnet8
12) en7: USB Ethernet
==> ubuntu-monitoring: When choosing an interface, it is usually the one that is
==> ubuntu-monitoring: being used to connect to the internet.
 ubuntu-monitoring: Which interface should the network bridge to? 1
==> ubuntu-monitoring: Preparing network interfaces based on configuration...
 ubuntu-monitoring: Adapter 1: nat
 ubuntu-monitoring: Adapter 2: intnet
 ubuntu-monitoring: Adapter 3: bridged
==> ubuntu-monitoring: Forwarding ports...
 ubuntu-monitoring: 3000 (guest) => 3000 (host) (adapter 1)
 ubuntu-monitoring: 8080 (guest) => 8080 (host) (adapter 1)
 ubuntu-monitoring: 8083 (guest) => 8083 (host) (adapter 1)
 ubuntu-monitoring: 8086 (guest) => 8086 (host) (adapter 1)
 ubuntu-monitoring: 22 (guest) => 2222 (host) (adapter 1)
==> ubuntu-monitoring: Running 'pre-boot' VM customizations...
==> ubuntu-monitoring: Booting VM...
==> ubuntu-monitoring: Waiting for machine to boot. This may take a few minutes...
 ubuntu-monitoring: SSH address: 127.0.0.1:2222
 ubuntu-monitoring: SSH username: vagrant
 ubuntu-monitoring: SSH auth method: private key
 ubuntu-monitoring: Warning: Remote connection disconnect. Retrying...
 ubuntu-monitoring: Warning: Remote connection disconnect. Retrying...
==> ubuntu-monitoring: Machine booted and ready!
==> ubuntu-monitoring: Running provisioner: shell...
 ubuntu-monitoring: Running: inline script
==> ubuntu-monitoring: stdin: is not a tty
==> ubuntu-monitoring: Get:1 http://security.ubuntu.com trusty-security InRelease [65.9 kB]
==> ubuntu-monitoring: Ign http://archive.ubuntu.com trusty InRelease
==> ubuntu-monitoring: Get:2 http://archive.ubuntu.com trusty-updates InRelease [65.9 kB]
==> ubuntu-monitoring: Get:3 http://security.ubuntu.com trusty-security/main Sources [118 kB]
==> ubuntu-monitoring: Get:4 http://archive.ubuntu.com trusty-backports InRelease [65.9 kB]

<...snip...all the install stuff happens here ...snip...>

==> ubuntu-monitoring: * Starting Grafana Server
==> ubuntu-monitoring: ...done.
==> ubuntu-monitoring: Processing triggers for ureadahead (0.100.0-16) ...
==> ubuntu-monitoring: Starting the process influxdb [ OK ]
==> ubuntu-monitoring: influxdb process was started [ OK ]
==> ubuntu-monitoring: * Starting Grafana Server
==> ubuntu-monitoring: * Already running.
==> ubuntu-monitoring: ...done.

Next, once the above process has completed, Vagrant will also launch the vSRX followed by the Ansible provisioner which will merge the template config and push this configuration to the newly launched vSRX device and startRPM polling to various points on the internet. RPM will store those results into it’s 1024 position rolling memory table. We’ll poll this table using our python script on our Ubuntu server in a second.

==> vsrx: Importing base box 'juniper/ffp-12.1X47-D15.4-packetmode'...
==> vsrx: Matching MAC address for NAT networking...
==> vsrx: Checking if box 'juniper/ffp-12.1X47-D15.4-packetmode' is up to date...
==> vsrx: Setting the name of the VM: vagrant-vsrx
==> vsrx: Fixed port collision for 22 => 2222. Now on port 2200.
==> vsrx: Clearing any previously set network interfaces...
==> vsrx: Preparing network interfaces based on configuration...
 vsrx: Adapter 1: nat
 vsrx: Adapter 2: intnet
==> vsrx: Forwarding ports...
 vsrx: 22 (guest) => 2200 (host) (adapter 1)
==> vsrx: Running 'pre-boot' VM customizations...
==> vsrx: Booting VM...
==> vsrx: Waiting for machine to boot. This may take a few minutes...
 vsrx: SSH address: 127.0.0.1:2200
 vsrx: SSH username: vagrant
 vsrx: SSH auth method: private key
 vsrx: Warning: Remote connection disconnect. Retrying...
 vsrx: Warning: Remote connection disconnect. Retrying...
==> vsrx: Machine booted and ready!
==> vsrx: Setting hostname...
==> vsrx: Running provisioner: ansible...
 vsrx: Running ansible-playbook...

If your Ansible fails to connect (I was getting this) complaining about the following, try adding the vagrant insecure key to your ssh-agent

TASK [Deploy config to device ... please wait] *********************************
task path: /Users/barnesry/PycharmProjects/Junos_RPM/provisioning/playbook-deploy-config.yaml:15
ESTABLISH LOCAL CONNECTION FOR USER: barnesry
127.0.0.1 EXEC /bin/sh -c '( umask 22 && mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1468620084.17-200326305408311 `" && echo "` echo $HOME/.ansible/tmp/ansible-tmp-1468620084.17-200326305408311 `" )'
127.0.0.1 PUT /var/folders/p3/7cq2wpk943d331zlsr808k2c001364/T/tmpo7LD9L TO /Users/barnesry/.ansible/tmp/ansible-tmp-1468620084.17-200326305408311/junos_install_config
127.0.0.1 EXEC /bin/sh -c 'LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/env python /Users/barnesry/.ansible/tmp/ansible-tmp-1468620084.17-200326305408311/junos_install_config; rm -rf "/Users/barnesry/.ansible/tmp/ansible-tmp-1468620084.17-200326305408311/" > /dev/null 2>&1'
fatal: [vsrx]: FAILED! => {"changed": false, "failed": true, "invocation": {"module_args": {"comment": null, "confirm": null, "console": null, "diffs_file": null, "file": "/tmp/vsrx.conf", "host": "127.0.0.1", "logfile": "/tmp/changes.log", "overwrite": "yes", "passwd": null, "port": 2200, "replace": "no", "savedir": null, "timeout": 0, "user": "vagrant"}, "module_name": "junos_install_config"}, "msg": "unable to connect to 127.0.0.1: ConnectAuthError(127.0.0.1)"}

When I looked in the automatically generated Ansible inventory file, all the parameters look OK. Namely – I’m seeing¬†ansible_user passed through as ‘vagrant’ as defined in the Vagrantfile “vsrx.ssh.username = ‘vagrant'” line, otherwise it will attempt to log into the device using your current username, which you probably don’t want. I further confirmed it was connecting correctly to localhost:2200, but for some reason it was having issues passing the insecure public key found in vagrant.d directory.

barnesry-mbp:Junos_RPM barnesry$ cat /Users/barnesry/PycharmProjects/Junos_RPM/.vagrant/provisioners/ansible/inventory/vagrant_ansible_inventory
# Generated by Vagrant

ubuntu-monitoring ansible_ssh_host=127.0.0.1 ansible_ssh_port=2222 ansible_ssh_user='vagrant' ansible# Generated by Vagrant
_ssh_private_key_file='/Users/barnesry/.vagrant.d/insecure_private_key'
vsrx ansible_ssh_host=127.0.0.1 ansible_ssh_port=2200 ansible_ssh_user='vagrant' ansible_ssh_private_key_file='/Users/barnesry/.vagrant.d/insecure_private_key'

[vsrx]

[all:children]
vsrx

Once I loaded this key into my ssh-agent ¬†I was able to successfully provision the vSRX. I’ll dig into why this is when I get some more time to troubleshoot.

ssh-add /Users/barnesry/vagrant.d/insecure_private_key

Now we can try and ssh into our vSRX. Success!

barnesry-mbp:Junos_RPM barnesry$ vagrant ssh vsrx
--- JUNOS 12.1X47-D15.4 built 2014-11-12 02:13:59 UTC
vagrant@vsrx>

You can force reprovisioning of your vSRX (if it failed) by issuing

vagrant provision vsrx

Configure

Hopefully, this last try worked for you. Now let’s confirm our settings have been applied correctly.

barnesry-mbp:Junos_RPM barnesry$ vagrant ssh vsrx
--- JUNOS 12.1X47-D15.4 built 2014-11-12 02:13:59 UTC

vagrant@vsrx> show interfaces terse | match inet
ge-0/0/0.0 up up inet 10.0.2.15/24
sp-0/0/0.0 up up inet
 inet6
sp-0/0/0.16383 up up inet 10.0.0.1 --> 10.0.0.16
ge-0/0/1.0 up up inet 192.168.56.107/24  <-- here's our local mgmt
lo0.0 up up inet 192.168.0.1 --> 0/0
lo0.16384 up up inet 127.0.0.1 --> 0/0
lo0.16385 up up inet 10.0.0.1 --> 0/0

vagrant@vsrx> ping 192.168.56.199  <-- can we ping our server?
PING 192.168.56.199 (192.168.56.199): 56 data bytes
64 bytes from 192.168.56.199: icmp_seq=0 ttl=64 time=19.662 ms
^C
--- 192.168.56.199 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 19.662/19.662/19.662/0.000 ms


vagrant@vsrx> show configuration services | display set
set services rpm probe test-owner-rpm test http probe-type http-get
set services rpm probe test-owner-rpm test http target url http://www.google.com
set services rpm probe test-owner-rpm test http probe-count 5
set services rpm probe test-owner-rpm test http probe-interval 60
set services rpm probe test-owner-rpm test http test-interval 60
set services rpm probe test-owner-rpm test http history-size 10
set services rpm probe test-owner-rpm test http thresholds successive-loss 2
set services rpm probe test-owner-rpm test http thresholds total-loss 3
set services rpm probe test-owner-rpm test http traps test-failure
set services rpm probe dns-rpm test ping probe-type icmp-ping
set services rpm probe dns-rpm test ping target address 8.8.8.8
set services rpm probe dns-rpm test ping probe-count 5
set services rpm probe dns-rpm test ping probe-interval 60
set services rpm probe dns-rpm test ping test-interval 60
set services rpm probe dns-rpm test ping history-size 10
set services rpm probe dns-rpm test ping thresholds successive-loss 2
set services rpm probe dns-rpm test ping thresholds total-loss 3
set services rpm probe dns-rpm test ping traps test-failure

vagrant@vsrx> ping 8.8.8.8   <-- can we ping the interwebs?
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=63 time=111.209 ms
^C
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 111.209/111.209/111.209/0.000 ms

vagrant@vsrx> show services rpm history-results <-- let's check some results
 Owner, Test Probe received Round trip time
 dns-rpm, ping Fri Jul 15 15:51:01 2016 100218 usec
 dns-rpm, ping Fri Jul 15 15:52:01 2016 99053 usec
 dns-rpm, ping Fri Jul 15 15:53:01 2016 101603 usec
 dns-rpm, ping Fri Jul 15 15:54:01 2016 110142 usec
 dns-rpm, ping Fri Jul 15 15:55:01 2016 101889 usec
 dns-rpm, ping Fri Jul 15 15:56:02 2016 100131 usec
 dns-rpm, ping Fri Jul 15 15:57:02 2016 110226 usec
 dns-rpm, ping Fri Jul 15 15:58:02 2016 100236 usec
 dns-rpm, ping Fri Jul 15 15:59:02 2016 100083 usec
 dns-rpm, ping Fri Jul 15 16:00:02 2016 101544 usec
 test-owner-rpm, http Fri Jul 15 15:49:01 2016 250198 usec
 test-owner-rpm, http Fri Jul 15 15:50:01 2016 268652 usec
 test-owner-rpm, http Fri Jul 15 15:52:01 2016 269524 usec
 test-owner-rpm, http Fri Jul 15 15:53:01 2016 250041 usec
 test-owner-rpm, http Fri Jul 15 15:54:01 2016 260565 usec
 test-owner-rpm, http Fri Jul 15 15:55:01 2016 270286 usec
 test-owner-rpm, http Fri Jul 15 15:56:01 2016 260264 usec
 test-owner-rpm, http Fri Jul 15 15:58:01 2016 319523 usec
 test-owner-rpm, http Fri Jul 15 15:59:01 2016 260195 usec
 test-owner-rpm, http Fri Jul 15 16:00:01 2016 251111 usec

Ok. Let’s summarize where we’re at.

  • I can reach my Graphana/Influx server on my vboxnet0 virtual network.
  • I can reach the internet as validated by ping
  • My configuration has successfully been provisioned by Ansible to my vSRX
  • RPM is busy polling every minute and logging this data to a history table

Let’s see if our other services are working as expected.

If we’ve done everything right, we should have port 8083 port forwarded from our localhost to our Ubuntu server on which InfluxDB should be listening. Let’s check.

Screen Shot 2016-07-15 at 4.08.20 PM

This looks good so far. How about Graphana?

Screen Shot 2016-07-15 at 4.10.04 PM

Also looks good.

Now we just need some data to look at, so let’s go and collect some. For that we’ll need to kick off our netconf-poll.py script to log into the vSRX, grab our RPM data and and insert it into our time series database. It’s worth noting the data flow here, and this is intentional.

  1. My python collection script is to be run¬†locally from my macbook. This is typically where I’d run my scripts from if I were polling some other network device.
  2. The server is accessible via vboxnet0 (which is default in most Vbox installs i’ve seen) and we’ll be collecting the RPM data and posting it via REST to InfluxDB hosted on the Ubuntu server. This is simply another attached network to your laptop.
  3. The vSRX is also accessible via vboxnet0. It’s polling the internet via one interface (ge-0/0/0 which is bridged to your external NIC), and we’re attaching to it on the internal NIC, and pulling off RPM data on it’s second (ge-0/0/1) interface.

You can confirm all this as well from the Virtualbox GUI as noted here.

Screen Shot 2016-07-15 at 4.18.37 PM.png

Let’s kick off our script. Since this is a demo – we’re hardcoded to attach to 192.168.56.107, but this could easily be modified to reflect whatever box you want to hit, or load up argparse or similar to pass those though to the script directly.

barnesry-mbp:Junos_RPM barnesry$ ./netconf-poll.py
2016-07-15 16:24:47,282 - INFO - Connected (version 2.0, client OpenSSH_6.6)
2016-07-15 16:24:48,373 - INFO - Authentication (publickey) failed.
2016-07-15 16:24:48,422 - INFO - Authentication (password) successful!
2016-07-15 16:24:48,860 - INFO - initialized: session-id=1242 | server_capabilities=['http://xml.juniper.net/dmi/system/1.0', 'urn:ietf:params:xml:ns:netconf:capability:confirmed-commit:1.0', 'http://xml.juniper.net/netconf/junos/1.0', 'urn:ietf:params:xml:ns:netconf:capability:validate:1.0', 'urn:ietf:params:xml:ns:netconf:capability:candidate:1.0', 'urn:ietf:params:xml:ns:netconf:capability:url:1.0?protocol=http,ftp,file', 'urn:ietf:params:xml:ns:netconf:base:1.0']
2016-07-15 16:24:48,869 - INFO - Requesting 'ExecuteRpc'
2016-07-15 16:24:49,055 - INFO - Requesting 'ExecuteRpc'
2016-07-15 16:24:49,178 - INFO - Requesting 'ExecuteRpc'
2016-07-15 16:24:49,306 - INFO - Requesting 'ExecuteRpc'
2016-07-15 16:24:49,434 - INFO - Requesting 'ExecuteRpc'
2016-07-15 16:24:49,616 - INFO - Requesting 'ExecuteRpc'
2016-07-15 16:24:49,740 - INFO - Requesting 'ExecuteRpc'
2016-07-15 16:24:49,865 - INFO - Connected to vsrx.thelab.net, FIREFLY-PERIMETER running 12.1X47-D15.4
2016-07-15 16:24:49,866 - INFO - Connected to InfluxDB
2016-07-15 16:24:49,880 - INFO - Starting new HTTP connection (1): localhost
2016-07-15 16:24:49,888 - INFO - Staring metrics collection...
2016-07-15 16:24:49,888 - INFO - Requesting 'ExecuteRpc'
<type 'list'>
2016-07-15 16:24:50,017 - WARNING - Element rtt not returned
2016-07-15 16:24:50,019 - INFO - Resetting dropped connection: localhost
2016-07-15 16:24:50,044 - WARNING - Element rtt not returned
2016-07-15 16:24:50,047 - WARNING - Element rtt not returned
2016-07-15 16:24:50,049 - WARNING - Element rtt not returned
2016-07-15 16:24:50,051 - WARNING - Element rtt not returned
2016-07-15 16:24:50,053 - WARNING - Element rtt not returned
2016-07-15 16:24:50,056 - INFO - Sleeping for 600 seconds

Screen Shot 2016-07-15 at 4.25.40 PMSo, did this work? Let’s check. You’ll notice if you refresh¬†your Influx screen where previously _internal was the only database available, there should now be one called network. This was created on first run by the script.

First, select the¬†network database in the upper right corner, then let’s check which measurements were inserted.

Screen Shot 2016-07-15 at 4.29.18 PM

OK, so we have something here. Let’s dig a little further.

Screen Shot 2016-07-15 at 4.30.41 PM

Excellent, we have data. Now we need to tell Graphana where to pull it’s metrics from so it can graph them so we’ll point it at InfluxDB. I was hoping to automate all this manual config, but never quite made it there…TBD ūüôā

Screen Shot 2016-07-15 at 4.33.21 PMDefault graphana login is admin/admin if you’re wondering. Once logged in, let’s navigate to Data Sources/Add Data Source.

Screen Shot 2016-07-15 at 4.38.01 PMYou can plug in data similar to this screenshot, using the default root:root auth for Influx (don’t do this in production!). Then Save & Test.

 

Hopefully when you saved, it was successful. We can now navigate back to our main dashboard page, and create a New dashboard. For Panel Data Source select ‘rpm_data’. If you don’t see this option – Graphana Screen Shot 2016-07-15 at 4.41.39 PMmay not have¬†made a successful connection to the InfluxDB so there might be some additional troubleshoot to do there.

This should open up a SELECT area where you can enter in your queries. I fought with this for a bit, but there’s a small button on the right side for “Toggle Edit Mode” where you can paste your query right in rather than fighting with the field/form entry. I’ve included the queries here for easy copy/paste.

 

SELECT value/1000 FROM “rpm_history” WHERE owner=’test-owner-rpm’

SELECT value/1000 FROM “rpm_history” WHERE owner=’dns-rpm’

Screen Shot 2016-07-15 at 4.44.28 PM

You should end up with something that looks similar to this. Make sure you click SAVE at the top, or you’ll lose your changes. There’s also a typo in my screenshot w/r/t owner – it needs to match the data in the DB, and I had changed something somewhere…

Screen Shot 2016-07-15 at 4.45.43 PM

View

I had to adjust my time (zoom out) to find my data points as my times aren’t sync’d between my router/server/database but there they are. I can now track my HTTP/ICMP RTT times in a nice graphical interface.

Screen Shot 2016-07-15 at 4.48.23 PM

This just begins to scratch the surface of what’s possible, but it’s a good way to get the juices flowing when it comes to the types of metrics you’d like to get from¬†your network.

 

Advertisements

Where exactly are Pokemon Go servers hosted?

I’ll admit. I don’t really follow video games that closely. Mostly owing to the fact that I’m frankly too busy with other things to fill my time with them these days. It’s not that I shun them for any particular reason. I used to be a fairly big gamer, but those days are long gone.

So then, I was surprised this morning by a bunch of articles on the (un)availability of the services tasked with keeping hordes of Pokemon Go players happy this past week.

http://www.gamespot.com/articles/pokemon-gos-international-rollout-paused-as-server/1100-6441650/

I have a 6yr old daughter who is infatuated with Pokemon these days thanks to Netflix, but I honestly don’t have a clue what they are, or why anyone would like them. I’ll let you come to your own opinions on the merits of the game, but it does have a pretty interesting history for us technical folks. To save you some reading, Niantic is a startup spun out of Google in 2015, birthed from John Hanke, father of Keyhole, which was aquired by Google eventually became Google Earth.

During a work conversation today, a question came up that I thought for sure would be easy to answer. If Niantic’s services are being crushed by load, where are they hosted? I figured a quick google search would unearth the answer to this query, but alas my Google foo was not up to the task. So let’s dig in, shall we?

Pokemon Go has recently been released on Android and iOS. I an iOS user, so it’s not quite as easy as simply running a tcpdump on my laptop while running the app and finding the source (unless you’re jailbroken and whatnot). There will be a myriad of ways to do this by the way, but I thought documenting one of the ways to grab this data might be interesting as I learned a few things along the way as well so why not share it.

174pob

First things first. Let’s ¬†grab a .pcap of the data coming off my iOS device so we can analyse our outbound traffic and find out were it’s headed.

A quick check with tcpdump on en0 interface ( my wireless NIC ) was turning up only broadcast/multicast traffic only, but not traffic which I expected should be in flight in the air which I expected to see. After all, wireless IS shared media, right?

barnesry-mbp:~ barnesry$ ifconfig en0
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
 ether 60:f8:1d:c4:3b:38
 inet6 fe80::62f8:1dff:fec4:3b38%en0 prefixlen 64 scopeid 0x4
 inet 10.0.1.10 netmask 0xffffff00 broadcast 10.0.1.255
 nd6 options=1<PERFORMNUD>
 media: autoselect
 status: active

barnesry-mbp:$ sudo tcpdump -i en0 host 10.0.1.3
Password:
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on en0, link-type EN10MB (Ethernet), capture size 65535 bytes
23:16:16.338903 IP 10.0.1.3.mdns > 224.0.0.251.mdns: 0 [2q] [1au] PTR (QU)? _airplay._tcp.local. PTR (QU)? _raop._tcp.local. (78)
23:16:16.645533 ARP, Request who-has 10.0.1.3 tell 10.0.1.13, length 28

What then to do?

If you want to sniff all the wireless traffic around, you’ll need to drop your wireless card into ‘monitor’ mode. This is also outlined in the tcpdump man page in OSX and will result in passing up everything from the 802.11 layer including wireless SNR, etc but makes the capture much more interesting!

-I Put the interface in “monitor mode”; this is supported only on IEEE 802.11
Wi-Fi interfaces, and supported only on some operating systems.

barnesry-mbp:~ barnesry$ sudo tcpdump -I -i en0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on en0, link-type IEEE802_11_RADIO (802.11 plus radiotap header), capture size 65535 bytes
23:09:30.065941 4111060402us tsft 1.0 Mb/s 2452 MHz 11g -75dB signal -92dB noise antenna 0 Beacon (CenturyLink6989) [1.0* 2.0* 5.5* 11.0* 18.0 24.0 36.0 54.0 Mbit] ESS CH: 9, PRIVACY
23:09:30.075347 4111071507us tsft 1.0 Mb/s 2452 MHz 11g -92dB noise antenna 0
23:09:30.075625 4111071935us tsft 1.0 Mb/s 2452 MHz 11g -92dB noise antenna 0 Acknowledgment RA:b8:c7:5d:12:8e:f5 (oui Unknown)
23:09:30.112349 4111106996us tsft 1.0 Mb/s 2452 MHz 11g -92dB noise antenna 0 Beacon (Goiter) [1.0* 2.0* 5.5* 11.0* 6.0 9.0 12.0 18.0 Mbit] ESS CH: 9, PRIVACY
23:09:30.168411 4111162822us tsft 1.0 Mb/s 2452 MHz 11g -76dB signal -92dB noise antenna 0 Beacon (CenturyLink6989) [1.0* 2.0* 5.5* 11.0* 18.0 24.0 36.0 54.0 Mbit] ESS CH: 9, PRIVACY
23:09:30.214709 4111209396us tsft 1.0 Mb/s 2452 MHz 11g -38dB signal -92dB noise antenna 0 Beacon (Goiter) [1.0* 2.0* 5.5* 11.0* 6.0 9.0 12.0 18.0 Mbit] ESS CH: 9, PRIVACY
23:09:30.270764 4111265204us tsft 1.0 Mb/s 2452 MHz 11g -73dB signal -92dB noise antenna 0 Beacon (CenturyLink6989) [1.0* 2.0* 5.5* 11.0* 18.0 24.0 36.0 54.0 Mbit] ESS CH: 9, PRIVACY
23:09:30.284515 4111280917us tsft short preamble 24.0 Mb/s 2452 MHz 11g -41dB signal -92dB noise antenna 0 Clear-To-Send RA:60:f8:1d:c4:3b:38 (oui Unknown)

What I also didn’t know is there’s a built in wireless diagnostic tool in OSX which can help with this and the process is documented HERE. This will result in a .wpcap file dropped on your desktop containing all wireless traffic snorted off your wifi card. Cool.

I closed all the applications on my phone, and turned it off prior to starting the capture. We will need the initial handshake captured when the phone associates to the AP in order to decode the traffic later.

So we start the wireless capture (and my timer) and turn the phone on.

Screen Shot 2016-07-08 at 11.33.29 PMWithin about 30 seconds, the phone is on. It’s likely there will a bunch of stuff happening on boot, so we’ll wait a couple minutes for everything to settle down before launching anything which will make triangulation of our game traffic much easier. Then we’ll log in, move around a bit, then close down the capture.

 

Let’s load that resulting .wpcap into Wireshark for a look.

This is interesting… we’ve got data, but we can’t read much of it. That’s because it’s encrypted with WPA2-PSK (PSK=pre-shared-key) or WPA2-Personal if you’re my Airport Extreme AP. So we’ll need to decrypt the data to get anything useful here. We’ll do that in a bit.

Screen Shot 2016-07-08 at 11.37.49 PM

First, let’s narrow our search down a bit, we’ll need the MAC address of my iPhone. We can get this by navigating to Settings–>General–>About–>Wifi Address on the phone. In my case the MAC address ends in 14:18 which should be good enough to filter on.

Rather than pick through the capture looking for this, or using a display filter let’s use some of the built in tools Wireshark provides to help us.

Navigate to Statistics–>WLAN Traffic.

This will open a window similar to below. I filtered on Ch. 9 to find my AP (Goiter – I name my APs after human afflictions) which then displays the packet statistics by MAC address for my AP. Cool right? Remember the last two octets 14:18 from my phone? It’s responsible for 31.56% of the traffic on this AP.

Screen Shot 2016-07-08 at 11.44.30 PM

Now – to filter traffic to ONLY my iPhone, let’s apply a filter to the capture. Right click on the MAC, and choose “Apply As Filter”

Screen Shot 2016-07-08 at 11.48.06 PM

Now we’re only looking at filtered data specific to my phone, but now we need to decode it. You can do this as outlined in the Wireshark Wiki (since you already know your WPA passphrase, right?)

In my case, I chose Edit–>Preferences, then Select Protocols and browse down the IEEE 802.11. From here, ensure Enable Decryption is checked, and Edit your keys.¬†Screen Shot 2016-07-08 at 11.52.04 PM

Here I used wpa-pwd and simply entered my SSID passphrase in cleartext (I erased it for this screenshot).

Alternatively, I believe you can also select wpa-psk but you’ll need to enter in the full 64 digit hex key, as outlined here. The cleartext version worked for me so I didn’t bother with the latter.

Now, we’ve got a full decrypted traffic stream we can work with. You’ll also notice in the bottom status bar I’ve captured 350k odd packets, and I’m displaying only 70k of them (my iPhone) so let’s chuck the rest as we don’t need to work with such a big file. Choose File–>Export Specified Packets, and write our file out containing only the Displayed packets we’re interested in.Then re-open our smaller (more specific) file.

Analysis

Time for some¬†analysis. Choose Statistics–>Conversations. This will net us a table of each pair of IP endpoints, so we can get a better view of how much data was flowing between pairs of machines¬†during the capture. I found sorting by Rel Start to be useful, as I could match the appearance of certain IP addresses against my testing timeline. (Remember I waited for 120 or so seconds to start my test?)

Screen Shot 2016-07-09 at 12.07.22 AM

Excellent. There’s a notable gap in activity between 75s and 171s, so let’s start there and assume we probably kicked things off around 171 seconds into the capture.

For a high level view of what’s going on, DNS is a good place to start so let’s filter for that in the display filter using¬†dns.

 

Screen Shot 2016-07-09 at 12.18.28 AMSome interesting stuff to glean here. It looks like our first call is at 171s into the capture as well, attempting to resolve pgorelease.nianticlabs.com. Bingo.

This very likely confirms our relative start time. There’s some other queries for upsight and kontagent, which a quick google search will reveal are mobile analytics companies, likely in play here as well providing data back to the publisher. Next we have accounts.google.com which is also explainable as I logged in using my Google account. The interesting stuff ends around 209s when I snapped a picture on my phone, and it immediately went to upload it to iCloud. More on that in a bit. ūüôā

Here’s the key IP pairs (I think) we should be interested in.

  1. storage.googleapis.com -> which CNAMES to 216.58.217.48 (Whois 1E100net. aka. google CDN)
  2. cl3.apple.com.edgekey.net -> which CNAMES to akamai CDN

Screen Shot 2016-07-09 at 12.27.39 AM

Let’s see if our traffic profiles confirm some of this. Navigate to Statistics–>Conversations. I’ve selected IPv4 here, and sorted by Rel Start which should allow us to rule out any traffic occurring before relative timestamp 171sec.

Screen Shot 2016-07-09 at 12.31.06 AM

Since our game should be pretty chatty over time as we load the game, and report location, and receive game status I’d expect a high packet count, but perhaps low byte count. (small packets, but lots of them). We’ve actually got four¬†good candidates here.

  1. 216.58.217.48 – google 1e100.net (Denver)
  2. 216.58.216.129 – google 1e100.net (Seattle)
  3. 130.211.188.132 – googleusercontent.com
  4. 104.6.203.30 – Akamai via cl3.apple.com.edgekey.net

 

To resolve IP addresses to owners as I’ve done above, we’ll need to perform some WhoIs lookups to see who owns some of this IP space.

For this I personally like to use network-tools.com which has a good suite of tools, available in a consumable format. You can also use your trusty command line tools as well to accomplish this. The example below came from konagent.net, who apparently host with Softlayer.

barnesry-mbp:~ barnesry$ whois -h whois.radb.net 50.23.68.199
route: 50.23.64.0/18
descr: customer Alestra
origin: AS32098
mnt-by: MAINT-AS32098
changed: mmg@transtelco.net 20150618 #18:44:54Z
source: RADB

route: 50.23.64.0/18
descr: auto-generated route object for 50.23.64.0/18
origin: AS11172
mnt-by: MAINT-AS11172
changed: om_core@alestra.com.mx 20141031 #09:44:53Z
source: RADB

route: 50.23.64.0/18
descr: SOFTLAYER-sjc01
origin: AS36351
notify: noc@softlayer.com
mnt-by: MAINT-AS36351
changed: ipadmin@softlayer.com 20110104
source: RADB

route: 50.23.64.0/18
descr: REACH (Customer Route)
tech-c: RRNOC1-REACH
origin: AS36351
remarks: This auto-generated route object was created
remarks: for a REACH customer route
remarks:
remarks: This route object was created because
remarks: some REACH peers filter based on these objects
remarks: and this route may be rejected
remarks: if this object is not created.
remarks:
remarks: Please contact irr@team.telstra.com if you have any
remarks: questions regarding this object.
notify: irr@team.telstra.com
mnt-by: MAINT-REACH-NOC
changed: irr@team.telstra.com 20140206
source: REACH

Since Niantic spun out of Google in 2015, it’s a pretty good guess they’re hosted in Google Compute Engine of which their two IP’s listed above account for ~700pkts (or about 1/2 of the traffic from the big four) listed earlier.

I think we can further discount the IP address 104.6.203.30 as some background Apple stuff hosted in Akamai perhaps related to game launch (validation and such). This leaves… Google Compute Engine.

We can also rule out the large byte transfer at Rel Start 225.647351 as a deep dive on that lines up with our DNS query for Amazon AWS Oregon S3 bucket, preceded by some chatter to Apple CDN servers. This is further confirmed above by a whois on 54.231.163.58 (AWS-S3) and also by the upload ratio for this conversation… stuff going up, up, up into the cloud when I snapped my picture on my iPhone.Screen Shot 2016-07-09 at 12.37.22 AM

Summary

The intent of this post was initially to solve my earlier question of where Niantic was hosting Pokemon Go, which perhaps could have easier been assumed simply by looking at where Niantic was birthed (Google), but where is the fun in that?

The end result is less a proof point of this fact, but more a view into various troubleshooting tools and methodologies available to really deep dive into various network traffic patterns on the internet.

If you’ve gotten this far, hopefully it was an interesting enough read.

Troubleshooting the Juniper SRX jdhcpd

There is a new game in town when it comes to configuring your SRX to provide DHCP addresses.

The new method of configuration is using a new daemon called jdhcpd which is outlined in the following Juniper KB article.

Fair enough. I moved my configuration from the old method, to the new method to allow DHCP scopes to exist in routing-instances.

And yet – when I apply that configuration…my DHCP daemon doesn’t seem to be running.

Odd, right? I checked, and rechecked my configuration, rebooted, performed a “commit full” and still have the same results. A quick cull of all internet related posts also yields nothing significant but also a few other posts with no answers, which means either I’m quickly getting denser in my old age, or this behavior can be explained by a bug.

system {
  dhcp-local-server {
    group WIFI {
      interface fe-0/0/5.0;
    }
  }
}
access {
 address-assignment {
 pool WIFI-PUBLIC {
 family inet {
 network 172.29.0.0/24;
 range WIFI-PUBLIC-POOL {
 low 172.29.0.10;
 high 172.29.0.100;
 }
 }
 }
 }
}

root@SRX100-H2_Branch_1# run show version
Hostname: SRX100-H2_Branch_1
Model: srx100h2
JUNOS Software Release [12.1X46-D40.2]

root@SRX100-H2_Branch_1# commit full
Feb 12 06:31:32 init: can not access /usr/sbin/hostname-cached: No such file or directory
Feb 12 06:31:32 init: hostname-caching-process (PID 0) started
Feb 12 06:31:32 init: security-intelligence (PID 19821) started
Feb 12 06:31:32 init: can not access /usr/sbin/ipmid: No such file or directory
Feb 12 06:31:32 init: ipmi (PID 0) started
Feb 12 06:31:32 init: dhcp-service (PID 18286) exited with status=1 <-THIS IS BAD
Feb 12 06:31:32 init: dhcp-service (PID 19826) started
commit complete

Feb 12 06:30:42 SRX100-H2_Branch_1 mgd[2768]: UI_CHILD_START: Starting child '/usr/sbin/dhcpd'
Feb 12 06:30:42 SRX100-H2_Branch_1 mgd[2768]: UI_CHILD_STATUS: Cleanup child '/usr/sbin/dhcpd', PID 19436, status 0
Feb 12 06:30:44 SRX100-H2_Branch_1 mgd[2768]: UI_CHILD_START: Starting child '/usr/sbin/jdhcpd'
Feb 12 06:30:46 SRX100-H2_Branch_1 mgd[2768]: UI_CHILD_STATUS: Cleanup child '/usr/sbin/jdhcpd', PID 19441, status 0
Feb 12 06:31:32 SRX100-H2_Branch_1 /kernel: init: dhcp-service (PID 18286) exited with status=1
Feb 12 06:31:32 SRX100-H2_Branch_1 /kernel: init: dhcp-service (PID 19826) started
Feb 12 06:31:36 SRX100-H2_Branch_1 /kernel: setsockopt(RTS_ASYNC_NEED_RESYNC) ignored (dhcpd): client already active
Feb 12 06:31:31 SRX100-H2_Branch_1 jdhcpd: DH_SVC_UDP_SOCKET_EXISTS_FAILURE: UDP socket already established

root@SRX100-H2_Branch_1# run show system processes | match dhcp
19816 ?? S 0:00.38 /usr/sbin/dhcpd -N  <-- this should be jdhcpd!!!
19428 p0 S+ 0:00.07 egrep -i dhcp (grep)

root@SRX100-H2_Branch_1# run show dhcp server statistics
error: the dhcp-service subsystem is not running  <- sad panda

 

 

So I tried dumping just this bit of config on fresh SRX, and … it works! So there must be something in my existing configuration that’s causing this. But what is it.

By pasting blocks of config sequentially, and checking my running daemons I was able to narrow down what worked, and what didn’t. Then it was a matter of adding/removing bits of the config until I determined exactly which part was the culprit.

It turns out, that the autoinstallation process relies on the dhcpd daemon (I’m surmising) to get it’s assigned address on boot, except that process isn’t compatible with the *new* way of configuring DHCP. Sigh.

 

As soon as I inactivate this part of the config, DHCP works as it’s not trying to start both DHCP daemons. I suspect because autoinstallation is the first stanza in the config file, it’s likely that daemon starts first, followed by the jdhcpd daemon which can never bind to a socket.

root@SRX100-H2_Branch_1# show system autoinstallation
##
## inactive: system autoinstallation
##
usb {
 disable;
}
root@SRX100-H2_Branch_1# commit
commit complete
root@SRX100-H2_Branch_1# run show system processes | match dhcp
18286 ?? S 0:02.07 /usr/sbin/jdhcpd -N <-- Success!!
root@SRX100-H2_Branch_1# run show dhcp server binding
IP address Session Id Hardware address Expires State Interface
172.29.0.11 4 08:00:27:df:75:5f 81431 BOUND fe-0/0/5.0

 

Much better. Since I didn’t find any reference to this while exercising my Google Foo, hopefully this ends up helping someone else.

Generating Custom Juniper Syslog Messages

I wanted to focus on a lesser known feature which I’ve found useful over the years when trying to setup NMS alerting and logging which doesn’t typically garner much attention in the documentation.

This would be the logger utility available on any JunOS platform from the shell. This program allows you to generate virtually any syslog event at will to test your configurations and confirm you have the correct hosts, priority, facility and filtering set on your local host as well as confirm your alerting is working correctly w/o having to flap interfaces, or other intrusive testing.

Here’s a quick overview on our available Severity and Facility definitions.

So let’s say I want to generate a syslog message to a remote syslog collector, but I only want to send messages which are generated by the external facility with a severity of info or better.

This process is very well outlined in the Day One Book : Applying Junos Event Automation

But to save the hunting, here’s the relevant excerpt;


 

How to use the logger test utility

1. The logger utility is a shell command, and so the user must first start a system shell by invoking the start shell command:

    user@Junos> start shell %

2. The logger utility has the following command syntax: logger -e EVENT_ID -p SYSLOG_PRIORITY -d DAEMON -a ATTRIBUTE=VALUE MESSAGE. Only the EVENT_ID is required, and it must be entered entirely in uppercase:

    % logger -e UI_COMMIT

The above command causes a UI_COMMIT event to be generated, originated from the logger daemon, with no attributes, no message, and a syslog facility/severity of user/notice.

The default settings can be altered by using one of the optional command line arguments.

3. For an alternate syslog facility/severity use the -p argument and specify the facility/severity in the same facility.severity format used by the jcs:syslog() function:

¬† ¬† % logger -e UI_COMMIT ‚Äďp external.info

MORE? See Day One: Applying Junos Operations Automation for a table that lists the valid syslog facilities and severities for the jcs:syslog() function. 4. To alter what daemon generated the event, use the -d argument:

¬† ¬† % logger -e UI_COMMIT ‚Äďd mgd

5. Include attributes for the event by using the -a argument. Use the argument multiple times if more than one attribute is needed. The attribute name must be in lowercase and should be followed by an equal sign and the desired value:

    % logger -e UI_COMMIT -a username=user -a command=commit

6. The syslog message follows all the command line arguments. Quotes are not required but are recommended for clarity:

    % logger -e UI_COMMIT -d mgd "This is a fake commit."

The above command causes the following message to be shown in the syslog:

 

    Jul 22 12:47:03 Junos mgd: UI_COMMIT: This is a fake commit.

NOTE When using the logger utility the event ID must always be in uppercase and the attribute names must always be in lowercase.


 

So let’s test this out.

 

root@vSRX-NAT-GW% logger -e LICENSE_VIOLATION -p external.info "This is a test of the emergency broadcast system"
root@vSRX-NAT-GW% exit
exit
root@vSRX-NAT-GW> show log messages | match LICENSE_VIOLATION | last 10
Jan 27 04:53:59 vSRX-NAT-GW logger: LICENSE_VIOLATION: This is a test of the emergency broadcast system

 

You can further confirm this by checking the syslog configuration on your local device. In this case, I’ve specified I’m ONLY sending messages to external host 192.168.56.11 if the facility is ‘external’ AND the severity is ‘info’ or greater (ie. not debug) AND the regex of the message matches LICENSE. Otherwise, we’ll likely have a local catch-all configured with any-any to locally log messages we may not be explicitly interested in looking at on the remote server.

root@vSRX-NAT-GW> show configuration
## Last commit: 2016-01-27 04:59:04 UTC by root
version 12.1X47-D20.7;
system {
  syslog {
    host 192.168.56.11 {
      external info;
      match LICENSE;
    }
   file messages {
     any any;
     authorization info;
   }
   file interactive-commands {
     interactive-commands any;
   }
}

 

To check this is working correctly, we’ll throw a ‘monitor traffic’ on the management interface looking for outbound UDP traffic to my syslog host, and we’ll generate some messages.

root@vSRX-NAT-GW> monitor traffic interface ge-0/0/0 matching "host 192.168.56.11 and udp"
[...]
Use <no-resolve> to avoid reverse lookups on IP addresses.
^C
941 packets received by filter
root@vSRX-NAT-GW>

 

And here’s the various tests we ran;

barnesry-mbp:python barnesry$ ssh root@192.168.56.107
Password:
--- JUNOS 12.1X47-D20.7 built 2015-03-03 21:53:50 UTC
root@vSRX-NAT-GW% logger -e LICENSE_VIOLATION -p daemon.warning "This is a test"
root@vSRX-NAT-GW% logger -e VIOLATION -p external.info "This is a test"
root@vSRX-NAT-GW% logger -e VIOLATION -p external.info "This is a test"

You can see in the capture above we captured no packets leaving my local host as a result of any of the above tests we generated.

Now, let’s restart the packet capture again, and generate a message we’re pretty certain will match.

barnesry-mbp:python barnesry$ ssh root@192.168.56.107
Password:
--- JUNOS 12.1X47-D20.7 built 2015-03-03 21:53:50 UTC
root@vSRX-NAT-GW% logger -e LICENSE_VIOLATION -p external.info "This is a test"

 

Let’s check our packet capture again.

root@vSRX-NAT-GW> monitor traffic interface ge-0/0/0 matching "host 192.168.56.11 and udp"
[...]
05:09:17.723090 Out IP truncated-ip - 42 bytes missing! 192.168.56.107.syslog > 192.168.56.11.syslog: SYSLOG local2.info, length: 74
^C
121 packets received by filter
0 packets dropped by kernel
root@vSRX-NAT-GW>

Success!

Blowing up your Juniper QFX TCAM

So a rather ‘neat’ feature with the Juniper QFX3500 is it’s limited TCAM space for storing ACLs, and more specifically if your ACL’s get too large to fit – they simply spill out and become inactive.

In this case – these are ingress ACLs applied to the lo0 interface as a control-plane filter and here’s how you tell if you’re running into problems.

user@router> request pfe execute command "show filter hw groups" target fpc0 | match iRACL
GOT: iRACL group id: 14. Entries: 512 Max Entries: 512 Pri: 1 Slice: 4 Def Entries: 0

Now – if you execute this command on your platform and it’s missing – that may indicate your control-plane protection you’ve deployed is actually inactive. The output may differ between JunOS versions, and this output was obtained on¬†12.2X50-D20.4

You can see from the output above I’ve used 512 TCAM entries of available 512 which is not ideal. If you want to find out how your configuration allocates TCAM you can execute the following;

 

user@router> start shell
% vty fpc0
TOR platform (1200000000Mhz XLR processor, 89MB memory, 0KB flash
TFXPC0(vty)# show filter hw 1 show_term_info
======================
Filter index : 1
======================
- Filter name : CONTROLPLANE-IN
+ Hardware Instance : 1
+ Hardware key (struct brcm_dfw_hw_key_t):
- Type : IRACL
- Vlan id : 0
- Direction : ingress
- Protocol : 2 (IPv4)
- Port class id : 0
- Class id : 0
- Loopback : 1
- Port : 0(xe-17)
- Vlan tag : 0
+ FP usage info (struct brcm_dfw_fp_t):
- Group : IFP iRACL group (14)
- My Mac : 00:00:00:00:00:00
- Loopback Reference Count : 00000001
- IFL Type : unknown (0)
- List of tcam entries : [ total: 512; 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 ]
- List of ranges : [ total: 0; ]
- List of interface match entries : [ total: 0; ]
- List of dot1q-tag match entries : [ total: 0; ]
- List of l3 ifl index entries : [ total: 0; ]
- List of vfp tcam entries : [ total: 0; ]
+ Misc info (struct brcm_dfw_misc_info_t):
- List of <anlz_id, entry_id> : [ total: 0; ]
+ Bind point info (union brcm_dfw_bind_point_info_t):
+ Loopback : CPU Traffic
+ Programmed: YES
+ Total TCAM entries available: 256
+ Total TCAM entries needed : 512
+ Term Expansion:
- Term 1: will expand to 14 terms: Name "SSH"
- Term 2: will expand to 14 terms: Name "NETCONF-SSH"
- Term 3: will expand to 3 terms: Name "TACACS"
- Term 4: will expand to 10 terms: Name "NTP-SRC"
- Term 5: will expand to 10 terms: Name "NTP-DST"
- Term 6: will expand to 5 terms: Name "ICMP-ECHO"
- Term 7: will expand to 4 terms: Name "ICMP"
- Term 8: will expand to 6 terms: Name "TRACEROUTE-PORTS"
- Term 9: will expand to 10 terms: Name "SNMP"
- Term 10: will expand to 3 terms: Name "DNS"
- Term 11: will expand to 24 terms: Name "BGP-NEIGHBORS-SRC"
- Term 12: will expand to 24 terms: Name "BGP-NEIGHBORS-DST"
- Term 13: will expand to 1 term : Name "DENY-ALL"
+ Term TCAM entry requirements:
- Term 1: needs 56 TCAM entries: Name "SSH"
- Term 2: needs 56 TCAM entries: Name "NETCONF-SSH"
- Term 3: needs 12 TCAM entries: Name "TACACS"
- Term 4: needs 40 TCAM entries: Name "NTP-SRC"
- Term 5: needs 40 TCAM entries: Name "NTP-DST"
- Term 6: needs 20 TCAM entries: Name "ICMP-ECHO"
- Term 7: needs 16 TCAM entries: Name "ICMP"
- Term 8: needs 24 TCAM entries: Name "TRACEROUTE-PORTS"
- Term 9: needs 40 TCAM entries: Name "SNMP"
- Term 10: needs 12 TCAM entries: Name "DNS"
- Term 11: needs 96 TCAM entries: Name "BGP-NEIGHBORS-SRC"
- Term 12: needs 96 TCAM entries: Name "BGP-NEIGHBORS-DST"
- Term 13: needs 4 TCAM entries: Name "DENY-ALL"
+ Total TCAM entries available: 256
+ Total TCAM entries needed : 512
Total hardware instances: 1

 

Based on the output above, you’ll be able to determine which parts of your ACL can be preened to fit within the memory constraints of the platform and adjust accordingly.

 

Why running database applications over WAN connections are generally a bad idea

This is a problem I’ve run into more often than I’d like to admit during my regular course as a network administrator. You’ve probably heard some version of the following dialogue…

HelpDesk : Customer has called in saying the ‘network is slow’ and would like you to investigate to see why the network is causing his application issues.

Technician : I’ve checked all my traffic graphs, error counters, CPU stats, WAN links, etc and everything looks like it’s running fine. What exactly is slow?

HelpDesk : Well – the user is complaining that he has to wait for up to 5 minutes to retrieve his inventory query through <insert application here>. It never used to take that long…

Me : Has his been a progressive slowness, or did it take 3 seconds yesterday and 5 minutes today?

HelpDesk : I’m not sure…

To be fair, this complaint was more common when Microsoft Access was in heavier use, so before we all go ahead and blame the network for everything, lets take a quick look at how common database applications work in practice. A short disclaimer before I get too far into this. I’m not technically a database guy so there is very likely going to be some errors on the finer aspects of what I’m describing here. Feel free to correct me in the comments!

Let’s assume we have a database of 100,000 rows of data and the particular application I’m running is going to perform some sort of operation on them. If I have a poorly written query such that the client will go out and request all 100,000 rows of data, I have a problem. Each row required will be sent as a separate request which works out to 100k requests that must be sent on the ‘wire’ to retrieve the information we’re looking for. This is might work fine for you connected to a 1Gbps LAN, however lets hit up the math to compare what introducing a higher latency WAN link will do to performance.

At 1Gbps, the RTT (round trip time) for a large packet is going to be in the sub-millisecond range. Let’s use 0.012ms for our example. (see calculation below) and assuming there are no other impediments in the path to get this data including disk I/O, etc.

0.012ms X 100,000 requests = 1200ms  OR 1.2 seconds

Let’s say our client puts a request in to work from home, and connects to the corporate network using a home DSL or Cable connection with a reasonable amount of bandwidth available. I would venture that most VPN’s would be hard pressed to get better than a 25ms RTT to the corporate network, so we’ll use that as an example. Remember, the longer the physical distance, the higher the latency. You can’t speed up light, and every device in the path between client and server will introduce some sort of forwarding delay during packet processing.

25ms X 100,000 requests = 2,500,000ms OR ~42 minutes!

This problem has been described in great detail as the Bandwidth / Delay product.

One thing to keep in mind with these types of complaints is that the general word that starts to float about the office when problems like this become serious is bandwidth. Since most managers can buy more bandwidth, this becomes the magic bullet that will solve all problems. You’ve heard the conversation before… “Well, how much is it going to cost us to upgrade our T1 to a DS3. Just get it done!”

Now, consider this. A database request consists of a very small request for data, and subsequently for smaller databases, also a small response for each row requested. So from a bandwidth perspective if you were monitor the overall BW requirements you may find them to be quite low. The root of the problem here is the number¬†of requests that need to be transmitted (RTT’s) which results in the poor response times. This phenomenon is what fuelled the popularity of Citrix and other thin clients back in the day, whereby the database and client application would run from the same local LAN in the office, and the user would simply look at the screenshots (thin client) of the query results. More on that later…

Fortunately, reducing latency in practice is much easier than accelerating light, since you can do it by accident, simply by increasing your bandwidth. The exception to this rule is simply increasing your BW but not changing your physical interface. Specifically – if you subscribe to a partial T1 service whereby you are allocated a certain number of timeslots (let’s use 768k), simply requesting your provider to bump your line rate to the full T1 speeds (1.544mbps) will NOT improve your performance, simply because previously you were serializing your data at T1 speeds, but limited to 768bps of bandwidth. Increasing your bandwidth will NOT decrease your serialization delay, thus provide no performance improvements. More on serialization delay below!

The concept of¬†serialization delay¬†is the amount of time it takes to break a packet down into its ones and zeros and send it on the wire. I may cover the details at a later date, but in the interim, check the article already published on it¬†here. ¬†At slower link speeds, actually increasing the BW may¬†help reduce latency, simply because it takes less time to serialize the data onto the wire. So you’ll end up with the same amount of data transiting your WAN link, but taking less time to do it. Referencing the article above, you can see the serialization delay is almost halved by upgrading your 128k serial connection to a 256k serial connection. Your performance will improve not¬†necessarily because of more bandwidth, but because you’ve reduced your serialization delay.

Serialization Delay for 1500byte packets

128kbps = 93ms

256kbps = 43ms

1Mbps = 12ms

10Mbps = 1.2ms

100Mbps = 0.12ms

1Gbps = 0.012ms

Another option which claims to solve this problem, are products like the Riverbed Steelhead or Cisco WAAS which perform data object caching, compression, and optimization of small packet, high transaction applications. You need to be aware that anytime a device is analysing, replacing, compressing your data in flight, there are going to be extra delays introduced. However, the optimization algorithms used in these appliances generally offset the extra latency by improving user performance enough where using them is still beneficial. In fact, there are a number of use cases where these appliances make stellar improvements in the user experience in remote offices.

Based on the above info, you now have a couple of solutions in your toolkit.

1. Increase your BW, but more importantly, reduce your latency.

2. Introduce a thin client solution to keep the client closer to the server such as Citrix.

3. Use an application acceleration engine such as Cisco WaaS, or Riverbed Steelhead to perform compression and optimization of your remote queries.

So, the next time a co-worker or manager comes to you with a BW problem, you’ll be in a better position to explain exactly what¬†you are recommending as a solution, and more importantly,¬†why.