Category Archives: Troubleshooting

Consistent software versions on dual-partition JunOS devices

Modern Junos devices have dual boot partitions, each with their own copy of the operating system. This ensures that the device will boot if storage or other boot-related issues are detected on the primary boot partition.

However, when you manually update the software, it is only updated on one of the partitions which is then set as the boot partition. The second, original, partition will keep running on the previous version until it is manually updated as well. This might be an advantage if you are testing a new software version and want to quickly roll back to the old one. It could also wreak havoc if you are inadvertently falling back to a pre-historic software version that is missing or not compatible with some of the features you’ve since enabled. If you have found a stable software version, it’s best to keep both partitions in sync. Here’s how to do it…

Identifying a software mismatch

The first hint is given at boot time. Right before the login prompt and banner, the console will alert you of the software version disparity.

kern.securelevel: -1 -> 1
Creating JAIL MFS partition...
JAIL MFS partition created
Boot media /dev/ad0 has dual root support


WARNING: JUNOS versions running on dual partitions are not same


** /dev/ad0s1a
FILE SYSTEM CLEAN; SKIPPING CHECKS
clean, 242154 free (34 frags, 30265 blocks, 0.0% fragmentation)

You can also find out from operational mode. The command below will show you the software versions on both of the partitions.

netprobe@netfilter> show system snapshot media internal
Information for snapshot on       internal (/dev/ad0s1a) (backup)
Creation date: Jul 31 11:13:12 2014
JUNOS version on snapshot:
  junos  : 12.1X44-D35.5-domestic
Information for snapshot on       internal (/dev/ad0s2a) (primary)
Creation date: Mar 4 19:53:11 2016
JUNOS version on snapshot:
  junos  : 12.1X46-D40.2-domestic

As you can see from the output, we are currently running on partition /dev/ad0s2a, which has a newer software version than the first partition /dev/ad0s1a.

Cloning the primary partition

To get the software version from our now-primary partition over to the backup, the system will first format the backup partition and then clone the contents. This process is initiated with the command below.

netprobe@netfilter> request system snapshot slice alternate
Formatting alternate root (/dev/ad0s1a)...
Copying '/dev/ad0s2a' to '/dev/ad0s1a' .. (this may take a few minutes)
The following filesystems were archived: /

After the cloning process you might need to reboot the device, depending on the model. If all went well, you will no longer see the warning prompt for version mismatch:

Creating JAIL MFS partition...
JAIL MFS partition created
Boot media /dev/ad0 has dual root support
** /dev/ad0s1a
FILE SYSTEM CLEAN; SKIPPING CHECKS

And voila, the snapshot command will now show the same SW version on both partitions.

netprobe@netfilter> show system snapshot media internal
Information for snapshot on       internal (/dev/ad0s1a) (backup)
Creation date: Mar 4 21:45:37 2016
JUNOS version on snapshot:
  junos  : 12.1X46-D40.2-domestic
Information for snapshot on       internal (/dev/ad0s2a) (primary)
Creation date: Mar 4 21:48:51 2016
JUNOS version on snapshot:
  junos  : 12.1X46-D40.2-domestic

On EX switches, you can alternate between boot partitions by entering this command.

request system reboot slice alternate media internal

Unfortunately this doesn’t seem to work on SRX devices, at least not on the branch devices I’ve worked with so far. If anyone knows how to make this work on these SRXs I would love to hear about it!

Allowing inbound DHCP requests on a Cisco ZBFW

I came across an interesting one today, where a Cisco Zone-Based Firewall needed to be reconfigured to serve DHCP for a segment connected to it in a zone called “Guest”. It already had a policy-map configured for traffic from Guest to Self, which had ACLs for SSH management.

First I tried adding these two lines to that ACL, in the existing class-map

 permit udp any any eq bootpc
 permit udp any any eq bootps

Although I did see the ACL match counters increment, DHCP was not handing out addresses yet. A quick search led me to this page on the Cisco site. In the last paragraph, they state the following:

If the routers’ inside interface is acting as a DHCP server and if the clients that connect to the inside interface are the DHCP clients, this DHCP traffic is allowed by default if there is no inside-to-self or self-to-inside zone policy.
However, if either of those policies does exist, you need to configure a pass action for the traffic of interest (UDP port 67 or UDP port 68) in the zone pair service policy.

In my case, there was a policy configured but with the action set to inspect. To fix it, I had to add a new ACL and class-map to the Guest-Self policy-map.

New ACL that matches the DHCP traffic. The source and destination is set to any because of the DHCP request format.

ip access-list extended Guest-Self-DHCP-ACL
 permit udp any any eq bootpc
 permit udp any any eq bootps

Tie the ACL to a new inspect class map:

class-map type inspect match-any Guest-Self-DHCP-CMap
 match access-group name Guest-Self-DHCP-ACL

And finally, add the class-map to the policy-map with the pass action

policy-map type inspect Guest-Self-PMap
 class type inspect Guest-Self-CMap
  inspect
 class type inspect Guest-Self-DHCP-CMap
  pass log
 class class-default
  drop

After that the clients started receciving IP addresses again.

ZBFW-ROUTER#show ip dhcp binding
Bindings from all pools not associated with VRF:
IP address          Client-ID/              Lease expiration        Type
                    Hardware address/
                    User name
192.168.200.201     014d.970e.4136.af       Oct 21 2015 10:43 AM    Automatic

Outage Report – SRX ALG failure – ‘application failure or action’ logs

The initial complaint came from one of our branch offices, which was having issues reaching internal applications and internet sites. After some troubleshooting, we found that we could no longer query against the remote DNS server, and they could not reach their internal forwarders. First, I wrote it off as a DNS server issue but after more querying to different VPN networks it was clear that all UDP DNS traffic was being lost in transit.

To verify if the packets were going through, I started a packet capture on the LAN interface of the remote firewall and did some DNS queries. Fact, there were no UDP/53 packets arriving over the tunnel. No issues with TCP/53 though. So I enabled logging on this specific tunnel policy and while checking the logs for the DNS, I found these lines instead of the usual RT_FLOW_SESSION_CREATE entries.

2015-09-25 14:19:53	User.Info	10.164.3.70	1 2015-09-25T14:19:53.047 VPN_Box_A RT_FLOW - RT_FLOW_SESSION_CLOSE [junos@2636.1.1.1.2.36 reason="application failure or action" source-address="10.164.242.1" source-port="58358" destination-address="10.10.132.40" destination-port="53" service-name="junos-dns-udp" nat-source-address="10.164.242.1" nat-source-port="58358" nat-destination-address="10.10.132.40" nat-destination-port="53" src-nat-rule-name="None" dst-nat-rule-name="None" protocol-id="17" policy-name="FW-Policy100" source-zone-name="trust" destination-zone-name="untrust" session-id-32="51106" packets-from-client="0" bytes-from-client="0" packets-from-server="0" bytes-from-server="0" elapsed-time="1" application="UNKNOWN" nested-application="UNKNOWN" username="N/A" roles="N/A" packet-incoming-interface="reth0.0" encrypted="No "]

This is not a standard event: RT_FLOW_SESSION_CLOSE [junos@2636.1.1.1.2.36 reason=”application failure or action”

When going through the entire log file for other entries with the “application failure or action” message, I found many more related to RPC and FTP. This immediately pointed to the ALG engine.

2015-09-25 14:19:54	User.Info	10.164.3.70	1 2015-09-25T14:19:53.846 VPN_Box_A RT_FLOW - RT_FLOW_SESSION_CLOSE [junos@2636.1.1.1.2.36 reason="application failure or action" source-address="10.164.110.223" source-port="9057" destination-address="10.104.12.161" destination-port="21" service-name="junos-ftp" nat-source-address="10.9.1.150" nat-source-port="58020" nat-destination-address="10.xx.70.1" nat-destination-port="21" src-nat-rule-name="SNAT-Policy5" dst-nat-rule-name="NAT-Policy10" protocol-id="6" policy-name="FW-FTP" source-zone-name="trust" destination-zone-name="untrust" session-id-32="24311" packets-from-client="0" bytes-from-client="0" packets-from-server="0" bytes-from-server="0" elapsed-time="1" application="UNKNOWN" nested-application="UNKNOWN" username="N/A" roles="N/A" packet-incoming-interface="reth0.0" encrypted="No "]

2015-09-25 14:19:51	User.Info	10.164.3.70	1 2015-09-25T14:19:51.444 VPN_Box_A RT_FLOW - RT_FLOW_SESSION_CLOSE [junos@2636.1.1.1.2.36 reason="application failure or action" source-address="10.164.243.110" source-port="50801" destination-address="10.10.132.50" destination-port="135" service-name="junos-ms-rpc-tcp" nat-source-address="10.164.243.110" nat-source-port="50801" nat-destination-address="10.10.132.50" nat-destination-port="135" src-nat-rule-name="None" dst-nat-rule-name="None" protocol-id="6" policy-name="FW-Policy100" source-zone-name="trust" destination-zone-name="untrust" session-id-32="2317" packets-from-client="0" bytes-from-client="0" packets-from-server="0" bytes-from-server="0" elapsed-time="1" application="UNKNOWN" nested-application="UNKNOWN" username="N/A" roles="N/A" packet-incoming-interface="reth0.0" encrypted="No "]

Coincidentally, we had already received a ticket related to FTP traffic over a different VPN, but the tunnel was up and all other services were open, so we didn’t immediately relate the two cases.

As a quick workaround for DNS, I disabled the DNS ALG. If you want to know what the ALG exactly does, Bart Jansens has a good write-up here. We could live without those features for a while.

{primary:node0}
netprobe@VPN_Box_A> show configuration security alg dns 
disable;

After disabling the ALG, all our DNS queries were going through again. Instead of immediately closing the session when a response is received, the DNS sessions get a 60 second timeout. This gave a few more sessions in the flow table but nothing the SRX couldn’t manage.

The same workaround worked for FTP. Unfortunately, I couldn’t find a quick way to restart the ALG process so I have rebooted the cluster nodes over the weekend. After rebooting, all the services inspected by the ALGs worked as designed again.

*Hostnames, IP addresses and Policy names were changed