Measuring Network Performance – Latency or Jitter

There are some age old conundrums that may never be solved, questions like dog or cat, Facebook or Instagram, burger or taco or most importantly of all, #whiteandgold or #blueandblack? And when it comes to measuring network performance, we often debate the concept of jitter or latency. But what exactly do these two things measure and which is best for providing accurate indication of network performance? Or more importantly what exactly should we measure to maintain a good end user experience?

Ok, so first of all, let’s start with some definitions, and we perhaps need to broaden the set of metrics a little to get a good understanding of the impact to network performance.

What are Latency and Delay?

Latency and delay are close relatives and are often used interchangeably. However strictly speaking they are different things. Delay can be defined as the length of time it takes for a packet (well more accurately a bit) to move from one host to another. Factors that affect delay include things like processing time, queueing delay, transmission delay and propagation time.

By contrast, latency is commonly defined as round trip time (or RTT), that is the time taken for a packet to be sent plus the time taken for a corresponding packet to return. In this definition, latency is in effect bidirectional delay. Strictly speaking from a network viewpoint, latency should not include processing time on the server but for practical purposes, most network based monitoring solutions include server processing time in latency calculations.

Again, latency or RTT is a very common performance metric as it is relatively easy to calculate, especially for protocols like TCP that provide a mechanism to acknowledge packets. By contrast, delay can be difficult, as the receiver typically has no way of determining the precise time the packet was sent.

What are Packet Inter Arrival Time and Jitter?

More recently, there have been tools with more sophisticated performance monitoring metrics, specifically tools that can calculate Packet Inter Arrival Time (IAT) and Jitter. Packets transmitted across a network will have differing delays, even if they traverse the same path. As mentioned above, delay is affected by such things as queueing and transmission delays in routers or switches or propagation delays in the network itself.

Because each packet will have a different delay value, the ‘gap’ between them as they arrive at the endpoint will vary. This gap, or the time between arriving packets, is called the inter arrival time or IAT.

Jitter is essentially a derivative of IAT: jitter is calculated as the inconsistency of the packet inter arrival times. In simple terms, if the packet IAT is consistent the jitter can be considered low, conversely if there is a wide variance of IAT values, the jitter is calculated as high. Many applications, especially real time communication applications like VoIP and video are very sensitive to jitter.

What Affects Jitter?

If packets take the same path across the network, we would expect that IAT should be reasonably consistent, after all they are taking the same ‘route’ and pretty much the same time. Turns out that this is not the case and IAT, and consequently jitter, can be greatly impacted by two significant network conditions:

  • Packet Loss. If packets are lost across the network, the sending host needs to resend. Reliable transport protocols such as TCP have mechanisms to identify and address this condition, but the resulting side effect is larger IAT (and consequently jitter). Packet loss is a significant contributor to IAT and jitter.
  • Network Congestion. Similar to packet loss, if the network is congested then sending hosts may not be able to transmit frames immediately but rather need to buffer and retry. At best this contributes to longer IAT times and at worst to packet loss as buffers overflow and packet are lost. Again, congestion is a major contributor to IAT times as well as overall end user experience.

There are of course many other factors that influence IAT and jitter, but for the sake of simplicity we have focused on the two major contributors of packet loss and congestion.

Calculating Packet Loss and Congestion

So if packet loss and congestion have such an important impact of network performance why don’t we just measure them and be done with it? Well the simple answer is that they are actually very difficult to measure.

It is very simple to measure things like byte and packet counts – as a packet arrives we simply count the number of bytes and add to the total. Measuring packet loss is somewhat more difficult, when we measure packet loss we are actually trying to count things that don’t exist, we are trying to measure a packet that was lost.

The situation is actually worse than it seems, not only are we trying to count a packet that doesn’t exist anymore, transport protocols such as TCP actually retransmit when packets are dropped, so whilst the first once may be lost, a subsequent replica one arrives in its place, albeit a little later than expected.

From a monitoring perspective it can be very difficult to identify a retransmitted frame over an originating packet. That is, it is very difficult to accurately measure packet loss. In the same way, accurate measurement of congestion poses similar difficulties.

Given that IAT is directly impacted by packet loss and congestion, by monitoring IAT (and jitter) we can gain a very valuable and accurate picture of network congestion and packet loss without having to monitor either directly. In fact, IAT and jitter provide a much richer source of data to profile end user experience over traditional packet loss or latency techniques.

But Isn’t Latency the Gold Standard for Network Performance?

Intuitively, it makes sense to think that measuring latency (or RTT) would be the best way to monitor network performance and end user experience. After all, latency measures round trip time which in essence mimics the user experience of hitting ‘enter’ and having data returned to the screen. Why worry about more esoteric metrics such as IAT and jitter at all? The intuitive belief that RTT is the best way to measure end user experience has also been perpetuated by network monitoring vendors whose only ‘delay’ metric is in fact RTT and so it is presented as the metric of choice.

Turns out that whilst useful, when compared with IAT and jitter, RTT falls short in two main areas:

  • Firstly, many applications are tolerant of long RTT but particularly intolerant to IAT variation or jitter. In particular, RTT is not effective in measuring and maintaining real time streaming applications such as VoIP and video. These applications need more sophisticated metrics such as IAT and jitter.
  • Many vendors calculate RTT as the time taken to establish the initial TCP three way handshake (SYN-SYNACK-ACK). That is, the RTT is measured at the start of the flow and typically never recalculated. The nature of the TCP protocol with its sliding window algorithm (among other things) makes calculating RTT mid-flow difficult. Calculating RTT based on the initial flow setup time is like calculating your commute time on how long it takes to back out of the garage, it may give some indication but doesn’t take into account all the red lights on the way to the office.

For these reasons, we are seeing many monitoring vendors start to move away from pure RTT measurement. By incorporating delay metrics such as IAT and Jitter – which are calculated for every packet, and hence for the life of the flow – we can start to build a much richer picture of end user experience.

Finally, unlike RTT, more advanced metrics like IAT and jitter can infer complex network conditions such as packet loss and congestion providing deeper insight into network performance and end user experience.

Is Packet Capture Still Relevant in Modern Networks?

A long time ago in a data centre far, far away, packet analysers (or packet capture tools) were the default ‘go-to’ tool for network diagnostics. Packet analyser vendors made good money before being usurped by the ubiquitous rise of freeware offerings like Wireshark. But has the rise in network speeds and topologies rendered packet capture as a diagnostic technique to the annals of history?

Let’s look at a real life case study ….

You have a remote branch, say a dozen users or so, complaining of poor performance. Your existing network monitoring system may be showing good usage statistics on application mix, latency and throughput, but still the performance persists. What would be really great is if you could just drop a packet analyser on site to get a look at exactly what is going on.

No problem right?

Oh hang on, first you have to ship a packet analyser to site, then you need to configure a SPAN port or install a network tap (do I need a change request for this?). Of course you will need a network engineer to run the capture, hoping that we actually capture the issue before returning to assess what the issue is….

It’s no wonder packet capture has fallen out of favour!

What if we had a packet capture engine permanently installed in-line with a web based interface that we could initiate packet captures from where ever we are ..independent of where the remote site actually is. And better still what if that device was priced at a level where installing into every remote branch was still affordable?

That’s exactly what Byte25 delivers with the Byte25 Branch Appliance. A device specifically designed for remote branch monitoring complete with a web based packet capture engine (as well as a complete deep packet inspection and intrusion detection engine, but more on this in another post). The Byte25 Branch Appliance can be installed permanently in-line or connected to a switch SPAN/mirror port for complete visibility of ALL your packets when you really need visibility.

Why Network Performance Monitoring is Critical for Cyber Security

A quick browse around the Internet shows that almost all traditional network monitoring vendors are now touting their expertise in cyber security. A large part of this of is course the relative size of the network performance monitoring and cyber security markets … there is simply more budget for cyber than there is for old-school network performance monitoring.

But perhaps there is more to this shift. Network performance monitoring data, whilst valuable in it’s own right with regards maintaining a high level of user experience, also provides the perfect adjunct for cyber security incident identification, triage and resolution. Let’s take a quick look at a how network performance monitoring data can complement cyber security in a real world situation …

It’s Monday morning and your intrusion detection system (IDS) triggers a critical event informing that a network trojan has been identified. The IDS gives good information about the origin of the attack, affected devices and operating systems, impact and even point to potential resolutions. All good so far, you have caught the attack and are able to resolve the immediate threat to the organisation. In short, the IDS has done exactly what you needed it to do.

But this alone is not enough to fully assess the potential ongoing threat. The IDS has identified the source of the attack, in this case an IP address emanating from Eastern Europe – wouldn’t it be great to also be able to identify which other devices within your network the malicious IP address has communicated with?

Enter network performance monitoring data!

Network performance monitoring data can quickly identify which hosts within your network have been touched by the malicious IP address, how often and at what times. In our example, the IDS triggers an alarm for just one internal host, but the network performance monitoring data has identified connections from the malicious IP to 5 other internal hosts over the last 7 days – albeit on different protocols.

But further still, apart from identifiying all affceted hosts, the network performance data can quickly examine each of these internal hosts to see who else they are communicating with (both internally and externally) to quickly and easily identify potential lateral movement of the trojan within the internal network.

Correlating cyber security information such as IDS events with network performance monitoring data provides the ideal tool for not only diagnosing immediate threats but to also determine seemingly unrelated down-stream side effects that may have significant impact to the security of your organisation.

Byte25 provides a fully integrated set of network performance monitoring and threat detection appliances in a single platform for this very purpose. Correlating network performance and cyber security data sets has never been simpler.

HTTP/3 is Coming – Look Busy

There is a new major version of HTTP on the horizon, HTTP Version 3. There is nothing you need to do, as HTTP/3 will sit quite happily alongside older versions 1 and 2 promising better performance. The major technical difference is that HTTP/3 does not use TCP as the network transport layer protocol but instead uses a newer UDP based protocol called QUIC.

QUIC (pronounced “quick”) was developed by Google in 2012 (according to the IETF, QUIC isn’t officially an acronym although some suggest it stands for “Quick UDP Internet Connections)”. The move to QUIC rather than TCP aims to fix a problem of HTTP/2 called ‘head of line blocking’.

Hmmm … OK, I was with you until the term ‘head of line blocking’…. Just what is this problem and why do I care? Well, let’s try and put this in simple terms. The HTTP/2 protocol can download multiple concurrent streams in parallel to improve performance (often referred to as multiplexing). The problem here is that the underlying TCP transport protocol that HTTP/2 uses has no visibility of the higher level HTTP/2 multiplexing mechanism. When packets are lost, the TCP recovery techniques result in all the existing higher level HTTP/2 streams stalling regardless of which actual transaction was impacted by the lost packet.

Put simply, under HTTP/2, a single lost packet can disrupt multiple HTTP streams. A shift to HTTP/3 based on QUIC should see significant performance improvement particularly in the event of packet loss.

The QUIC transport protocol provides native multiplexing where lost packets only effect the streams where the data has been lost. All QUIC streams share the same single QUIC connection. Whilst they share the same connection, each QUIC stream is independent so that in most cases packet loss affecting one stream doesn’t affect any others. This is possible because QUIC packets are encapsulated on top of UDP datagrams.

On the upside, the shift to HTTP/3 will be largely transparent to end users. The major browsers are already supporting HTTP/3, Chrome, Firefox, Safari, even Edge. Server support is slower but likely to follow with Nginx and Cloudflare already stepping up to the plate. So the good news is that HTTP/3 is coming and it should provide a better Internet experience. Even better, and unlike other tech innovations (did anyone mention Ipv6?) there is little you need to do to migrate – the existing infrastructure will automatically upgrade to support HTTP/3.

Just sit back and look busy.

Are We Witnessing the Death of Network Performance Monitoring?

This might seem a strange subject for a blog post from a company like Byte25, whose main focus is in fact …. network performance monitoring, but it is an interesting question for consideration. I wouldn’t necessarily be calling the undertaker just yet, but there is no question that the role of network performance monitoring within IT operations is changing.

For those that don’t have time to read the whole blog …. Network Performance Monitoring is not dying, but it definitely needs to adapt to meet the need of modern network environments.

Way back in the early 2000’s when network bandwidth was small and expensive, network performance monitoring was an essential part of maintaining end user experience. SNMP (the Simple Network Management Protocol) was flavour of the month for collecting network performance metrics with a range of commercial and freeware tools (remember MRTG or HP OpenView?) dedicated to collecting information from switches and routers across the network.

Ahhhhh, the good old days! We could see how much traffic was running through the network pipes, gauge capacity and ensure the links were correctly provisioned – and life as a network engineer was good! But just as we thought we had it made, business started throwing more applications at the network, the links became more heavily utilised and questions were asked as to why end user performance was degrading.

SNMP, whilst good, simply wasn’t sufficient to answer these questions. We could see the volume of data but not the who or what was actually generating the traffic. If only we had tools that would identify the breakdown of actual traffic!

Fortunately flow based protocols such as NetFlow and IPFIX evolved to answer this very question. Using NetFlow, we could identify both the type of traffic (at least up to Layer 4) as well as the source and destination. Now we could carefully monitor network links and see exactly what applications and what users were contributing to bandwidth and act accordingly to ensure performant links.

But still the world did not stand still. Applications started to use common transport layer protocols, like HTTP. So too new dynamic port applications like VoIP emerged. This rendered application identification difficult using available flow based protocol techniques. To address this new shortfall, vendors turned toward sophisticated deep packet inspection (DPI) engines that examined data payloads to identify the actual application within the data stream.

DPI was network performance monitoring nirvana, presenting a detailed picture of which applications and which users were generating traffic. Many DPI implementations also had the added advantage of providing other performance metrics such as latency, jitter and round trip time further enhancing their usefulness. DPI provides a comprehensive view of network performance which persists today in many of the major network performance monitoring products.

So what’s the problem? Why the question on the relevance of network performance monitoring as a whole? Put simply, network performance monitoring has lost significance due to the availability of higher speed and lower cost network bandwidth. Organisations can ‘throw’ bandwidth at networks relatively cheaply making justification for expensive network monitoring tools difficult. Performance is no longer the main problem.

Network performance monitoring isn’t dying, but is does need to adapt to meet the need of business and modern IT operations.

First up, and most importantly, as a network visibility vendor, we need to deliver monitoring products that address the needs of both business and IT operations. Performance metrics like throughput and errors, latency information and application level statistics gained from DPI are still very relevant in modern networks (especially so as we move toward more distributed environment like SD WAN and cloud based applications) but they are only part of the solution.

Rather than focus of network link monitoring , performance monitoring tools need to leverage and correlate other data sources to provide a wholistic picture of network activity. Rather than be labelled as ‘network performance monitoring’ tools, we should be thinking of ‘network visibility’ tools that derive and correlate information from a range of disparate sources.

Network visibility tools should be integrating with other data sources in a similar fashion to what SIEM tools do in the cyber security space. In addition to techniques like DPI, network visibility tools need to also extract data from the range other systems that rely on or contribute to network activity. This includes a wide range systems such as Active Directory, Office365, cloud platforms like AWS, Intrusion Detection Systems, server logs and even end point agents. In this way, we can build a precise and useful picture of network activity. Or put more simply a complete picture of network visibility.

This is precisely what we are doing at Byte25. To maintain relevance, we are starting to leverage other datasets to present a picture of network visibility. We already have an integrated IDS engine sitting alongside our DPI engine and are working toward other integrations to be delivered later this year.

For sure there will always be a place for traditional network performance monitoring, but moving toward comprehensive network visibility provides a stronger product set that actually meets the need of IT operations and business as a whole.

A Battle for the Ages, Packet Based v Flow Based Analysis Tools

Network performance monitoring tools use a wide range of techniques to collect data for analysis. Two if the most common techniques are packet based data collection and flow based packet collection. In this blog post, we will take a look at the pros and cons of each, and propose a solution to maximise the value of both.

First, a quick discussion of each…

Flow based collection relies of ‘flow agents’ embedded in network devices such as switches or routers that collect and export network performance data to a central server for analysis and presentation. Flow agents usually collect meta-data per flow, that is, metrics pertaining to the communication between hosts in a network. Flow data presents statistical information from network conversations including such things as source and destination address, byte counts, packet counts and other information available in the TCP/UDP header.

By contrast, packet based collection relies on the ability to examine every packet traversing a link in order to extract performance metrics. Packet based collection usually uses a dedicated appliance installed either inline or connected to a SPAN port or network tap. Packet based collection provides a far more granular and precise mechanism for performance monitoring over flow based collection by also supporting more sophisticated data such as inter packet arrival time, latency and jitter. Additionally, packet based collection can also support deep packet inspection to look deep inside the packet payload to identify application specific metrics. It does however come at a cost, packet based collection can be difficult (and expensive) to do at high speed and the resulting data sets can be large and difficult to manage.

So let’s take a quick high level look of the pros and cons of each …

Flow based analysis is usually a lower cost option, after all, your switches and routers probably already support flow agents such as Cisco NetFlow, sFlow or IPFIX. Leveraging this data is often as simple as installing a flow capable collector to correlate and report on usage.

If all you are looking at is capacity planning type data or a high level overview of throughput and usage across the network, then flow based monitoring is definitely the way to go. In this case, flow based monitoring provides a quick and easy way to achieve high level visibility of network traffic.

The other upside of flow based monitoring data is that it is typically lightweight. Because it relies on flow meta-data, the actual amount of data produced is relatively small allowing for easy enterprise wide deployment.

However, for day to day diagnosis of complex network issues or for a better understanding of end user experience, flow based collection can have some limitations. Packet based collection provides a much richer source of data for detailed diagnostics and performance analysis.

Because packet based collection relies on every single packet being examined, detailed latency and packet inter arrival times can be collected to provide deep insight into actual performance of packets across the network and accordingly the ability to assess potential impact to end user experience. Additionally, deep packet inspection allows for application specific information unavailable with flow based monitoring, especially for proprietary applications or those that may use dynamic ports such as VoIP.

So on the face of it, packet based collection is the way to go right? Well, not necessarily ….

Packet based collection relies of dedicated hardware probes that can be expensive, especially on high speed links. It also generates huge amounts of data that can be difficult to manage and analyse. So whilst packet based collection provides a richer source of data, the potential cost to implement can make the solution unviable.

So which is better and which way should you go to get the best network performance monitoring solution?

Well, at the risk of being a total fence sitter, the simpler answer is … it depends. Both have their pros and cons and both are applicable in different environments. I think a better question is, how do we leverage the best of both techniques?

This is exactly our approach at Byte25. We have taken the best pieces of flow and packet based collection to develop a technique we call ‘hybrid packet based flow analysis’. That is, we examine each packet in the same fashion as traditional packet based collection to gather deep insight into performance including latency and deep packet inspection, but then create meta-data pertaining to each flow for storing in the analysis database. We are creating enriched flow data if you like, data with the traditional metrics of flow agents like NetFlow but enhanced with the deep insight data normally only available via packet based collection.

This technique also allows us to keep relatively lightweight data sets, more akin to flow collection, without losing information granularity – that is, we achieve the best of both worlds. In addition, because we store data in a flow based format we have the added advantage of being able to also incorporate feeds from flow based agents. This allows enormous flexibility and cost savings – we can deploy more expensive dedicated packet based probes on important links like major Internet egress connections, but utilise low cost flow based agents for visibility of potentially less critical network connections. The structure of the hybrid packet based flow analysis database allows for both sets of data to exist heterogeneously for easy analysis and reporting.

So the question isn’t which technique iToolss better, but rather how can I have the best of both worlds. This is the solution that Byte25 delivers to meet the needs of network performance monitoring in modern network topologies.

Why is my home network so slow?

Given that many of us have been working from home now for some months since the arrival of COVID, I thought it might be timely to take a look at the performance of asymmetric network links which are typical of many domestic Internet connections (at least this is certainly the case in Australia, and from what I can make out also common in many other parts of the world).

So what exactly is an ‘asymmetric’ network connection? Well, put simply it is where the upload capacity is different to the download capacity. For example, download speeds may be 50Mbps whereas upload is often considerably less at say, 10Mbps. This was very common in with older style aDSL connections, the ‘a’ even standing for ‘asymmetric’.

In most domestic environments asymmetric connections work fine. Most households download far more than they upload, typically watching streaming video such as NetFlix which is pretty much a one way download stream. In fact talking to a local carrier in pre-COVID times I was told that up to as much as 70% of their evening traffic was made up of NetFlix streams. So in this environment it makes perfect sense to tune the link to favour download rather than upload.

All good so far, until we all started working from home (WFH) on our asymmetric network connections. Rather than downloading streaming video, the WFH application of choice is more business related … enter video conferencing via applications such as Zoom.

Zoom quote on their website that for group video calling at 720p HD you need 1.5Mbps/1.5Mbps up/down (you have probably seen the Zoom ‘Gallery’ view). That is, we need as much upload as we do download. Makes sense right? I mean we are transmitting video as well as receiving.

But we should be cool, our download speed is 10Mbps, well above the specified Zoom requirements, and yes, in most situations, we should be just fine. But even so, I do occasionally have issues even when the stated bandwidth is clearly enough and even verified by running various speed tests that clearly show upload capacity well above 1.5Mbps.

So why do I still experience occasional issues?

Turns out that asymmetric links exhibit quite unusual behaviour when it comes to latency and jitter. Sure the throughput might be high but the variance of inter packet arrival time can vary greatly, especially when compared to the download traffic. Turns out that real time streaming traffic such as video and voice are very sensitive to variations in inter packet arrival time. That is video and voice are very sensitive to jitter.

Let’s drill down on this in more detail, at Byte25 we believe one of the best measures to gauge the performance of applications is what we call the Jitter Index or Inter Packet Arrival Time Variance Coefficient. This is a measure of relative variability calculated as the ratio of the standard deviation to the mean of inter packet arrival time. Put simply, the higher the coefficient of variation, the greater the level of variation of inter packet arrival time and hence the greater the impact on performance on apps like video and voice.

Don’t worry if this seems confusing, let’s look at some picture of real life traffic that will hopefully make this clear. Below is a Jitter Index graph from a network with a symmetrical Internet connection.

There are two things of note here, firstly the upload and download (from/to) Jitter Index values are consistently alike throughout the monitored period and secondly the ‘variance’ of the values are also reasonably consistent, that is, they all come in at around an index value of 1.0 (note what we are looking for here is not an absolute value, but rather the ‘shape’ of the values over time). Because there is little variation between the Jitter Index values, real time streaming applications like voice and video perform well.

When we contrast this with the same dashboard for an asymmetric link we see quite a different picture.

The shape of the Jitter Index is quite different. The Jitter Index From values are consistently higher than the Jitter Index To, that is the upload Jitter Index is higher than the download Jitter Index. Worse still, the Jitter Index from is not consistent but range from around 1.0 through to -4.0, that is, there is significant variance. This is not great for real time streaming protocols and it is likely that we could experience performance issues for apps such as Zoom.

In an ordinary troubleshooting scenario using a tool like Byte25, we would drill one and examine the Jitter Index per application, in this case Zoom, to quick diagnose potential issues relating to the connection but even with this high level example we can see significant differences between symmetric and asymmetric Internet connections.

Traditionally bandwidth and throughput have been the default go to metrics for network performance, but to really analyse and diagnose application performance it is often necessary to dig a little deeper in to latency and jitter measurements. Especially as we increasingly work from home and are reliant on video streaming apps such as Zoom, it is necessary to measure beyond the traditional metrics to ensure performant connections.

And just in case you are interested, we live and breathe this stuff at Byte25 so feel free to reach out if you want to discuss or need more information.

Distributed Packet Capture Solutions

… or the 3 major problems with packet capture in modern networks

Packet capture and analysis is troubleshooting technique that has been around for many years now, and for sheer power and completeness there is arguably no better diagnostic technique to identify network and security issues.

Networks have become larger and the modern topologies have made the ability to perform packet capture difficult. New network technologies such as SD-WAN mean that network traffic no longer necessarily traverses a single egress point but often sees remote sites with their own Internet connection. Whilst a great improvement for user performance, these new network topologies make troubleshooting more difficult by restricting visibility of what is happening at each site.

Packet capture is particularly difficult in these large distributed environments and as the need for diagnostic techniques like packet capture remain, the barriers to implementation are high. Let’s consider the 3 primary obstacles to packet capture in distributed network environments:

By it’s very nature, packet capture is complex. The ability to interpret packet traces and pinpoint security or performance issues requires a highly skilled (and expensive) network engineer. It is probably not a great use of a network engineers time to be sent to a remote site specifically to collect a packet capture.

Connecting a packet analyser to the network is difficult. Two common techniques involve configuration of a SPAN/mirror port on the switch or installation of a network tap. Configuring a SPAN port needs admin access to the switch, which is typically not easy, whilst installing a network tap requires a network outage to place the tap inline. Both these options are problematic in a production network.

The nature or network performance or security issues is ephemeral. By the time you dispatch an engineer to site to initiate a capture, chances are the issue is gone. What we need is the ability to turn on captures at will without the delay of physically attending a remote site

So here we have somewhat of a paradox, the benefits of a packet capture are high, but are the benefits high enough to outweigh the expense and effort of implementation? Put simply does the cost of the packet capture solution cost more than the problem we are trying to fix?

At the core of the issue is that packet capture is typically a ‘standalone’ diagnostic tool. That is, we ship an engineer to site with a standalone laptop or analyser to plug in and grab the capture. Engineers and laptops are expensive, not the sort of thing you want to have on call at every site in the event that you just might have an issue requiring a packet capture.

We need to change our mindset from standalone to distributed – that is a distributed packet capture solution. The distributed solution needs to be cost effective enough to be able to be installed in every remote site to allow immediate remote access to packet capture from a central location. It must also be centrally controlled so that engineers can initiate the capture from a simple web interface from anywhere.

This is exactly what the Byte25 solution delivers. A range of appliances suitable and priced for installation in very small remote sites right through to large multi GigE units for central data centre installation. The Byte25 solution allows network engineers to initiate remote packet capture anywhere from anywhere removing geographical constraints. The captures can be viewed immediately via the Bye25web user interface or downloaded to a third party analyser like Wireshark for deeper analysis.

The Byte25 distributed packet capture solution removes the constraints traditionally associated with packet capture at remote sites and enables one of the most powerful diagnostic techniques in the network engineers arsenal.

A Review of the Garland Technology FieldTap

There are some unique advantages in using a Garland FieldTap with a Byte25 appliance that were not immediately obvious until we connected it up and ran some tests. Turns out this is a perfect tap to use for monitoring.

For information on the FieldTap from Garland Technology, look here:

https://www.garlandtechnology.com/products/fieldtap

A quick overview of the Garland FieldTap

I guess if you are actually reading this blog post, you probably have a fair idea of what a network tap actually does. But just in case, and in a as few words as possible, a network tap is a small hardware device that sits in-line on an ethernet connection and allows for a copy of all packets to be sent to a monitoring device. The beauty of a network tap is that it provides a fully fault tolerant solution – that is, it does not impact the performance of the ethernet connection and will continue to pass packets even in the event of a power outage.

The Garland FieldTap is an interesting and useful network tap variant. Normal ethernet network taps output a ‘copy’ of the packets on standard ethernet connections, by contrast, rather than using Ethernet, the Garland FieldTap uses a standard USB 3.0 to output packets. That is, the Garland FieldTap listens to standard copper Ethernet but outputs a copy of all traffic to a USB 3.0 interface. This allows great flexibility as to what type of monitoring devices can be deployed. I guess the most common deployment scenario for Garland FieldTaps is the use by field engineers with laptops for network troubleshooting, however the flexibility of the Garland FieldTap USB output solves a specific use-case for Byte25 when deploying monitoring appliances.

The following review was undertaken by Byte25 to assess the suitability of using the Garland FieldTap for connection to Byte25 monitoring appliances in production environments.

Dealing with Garland Technology

Not sure if the experience of dealing with vendors should form part of any review, but I am going to include it here as the Garland guys are super easy to deal with. Big shout out to Kumar Rajaram the Garland APAC Regional Director who drove all the way across greater Sydney to hand deliver a couple of units for us to play with. I certainly haven’t experienced this level of support from the bigger tap and NPB vendors.

Integrating with Byte25

Ok, so now to the technical stuff. We plugged the Garland FieldTap in and connected to a USB port on a Byte25 appliance. We use a Debian Linux distribution on our appliances and once connected, the Garland FieldTap was immediately visible as 2 new Debian network interfaces, in this case eth1 and eth2.  No configuration or software required, the USB chipset and ethernet ports are supported under the standard Debian ‘Buster’ distro. Just to double check that we had the right interfaces, we ran ethtool  on these interfaces which clearly showed the driver to be for a LAN79xx USB to ethernet chipset as expected in a USB capable network tap.

The two interfaces presented from the FieldTap under Linux (and I guess the same would be true for other operating systems) correspond to each of the ethernet ports on the FieldTap itself.  But even better than that, it looks like the each interface defined under Linux for the FieldTap relates respectively to the transmit and receive sides of the ethernet link being monitored. This is clever design by an engineer that understands network monitoring – only forwarding the transmit side of each ethernet port means packets are counted only once by the connected appliance. This makes monitoring super easy, as we can just listen on the 2 FieldTap USB defined ethernets under Linux, eth1 and eth2 and we get an exact copy of the monitored Ethernet link.

This made integration with Byte25 seamless, the FieldTap simply presents as 2 extra interfaces which we then listen on – so no code changes on our end (I like this bit the best :).

What Happens When the Power Goes Out?

Now this was one of the nicest things about the FieldTap, and definitely something I wasn’t expecting. When a ‘normal’ ethernet tap loses power, electronic relays are activated to close the circuit for the monitored ethernet link to ensure the link remains active. This process results in a very short outage which normally triggers the ethernet connections on each side to renegotiate resulting in an actual outage of a few seconds. By contrast, the FieldTap has a cool design feature that allows it draw power from the USB interface. This means that even if the tap loses external power, the ethernet interfaces remain powered up so there is no ethernet renegotiation and so no actual outage. For those that are familiar with Ixia taps, this is very similar in functionality to their zero delay tap but way less expensive and actually probably more robust as it doesn’t rely on a battery that can run flat.

The Advantages of the FieldTap over Bridges and Mirror Ports

Most Byte25 customers use SPAN or mirror ports configured on a switch to access traffic, and most of the time this works fine. There are plenty of resources around pointing out the shortfalls of using a switch SPAN/mirror ports but in our experience, for a visibility solution like Byte 25, SPAN/mirror ports work fine.

I say SPAN/mirror ports work ‘most of the time’, in reality though there are often 2 issues. Firstly it is often difficult to actually configure a SPAN/mirror port, the connected switch may not support port mirroring or the customer may not have access to the switch config, and secondly, the low end Byte25 Branch appliance only has one physical ethernet port making connecting a switch mirror port difficult.

To get around this, at Byte25 we have implemented an inline solution using an ethernet chipset with bypass functionality. That is the appliance can sit in line and in the event of a power outage the chipset will initiate a relay to ‘fail close’ and keep traffic flowing. In essence this is the same functionality as a network tap. However, when the appliance is running, we need to implement bridging across the ethernet ports in software in order for them to pass traffic. The bridging is dependent on the Linux kernel, so if Linux has a problem, such as exceptionally high CPU usage, then we have the potential to impact network throughput. This is unusual and hasn’t happened to date, but it is certainly a potential point of weakness.

The Garland FieldTap presents a more robust solution in both these scenarios, firstly it doesn’t take up an extra ethernet port (which is perfect for the Byte25 Branch appliance) and secondly cannot impact network performance in event or a software issue within the Byte25 appliance.

For implementations that require a high level of fault tolerance and robustness, Byte25 would certainly recommend the Garland FieldTap. Even when compared to a standard network tap there are significant advantages with the USB format which make the Garland FieldTap an excellent solution.

Correlation, the Key to True Network Visibility

Have you ever heard of the term ‘single pane of glass’ when IT people speak about monitoring tools … of course you have! It’s a oft used (and mis-used) term for a management console that presents data from multiple sources in a single display. I probably first heard this term bandied about as far back as the 90’s and I am still yet to see an effective solution that offers a single pane of glass dashboard.

Harsh? Maybe … but the reality is that even in single pane of glass solutions, data is typically still ‘siloed’. That is, each disparate data set exists in it’s own data store. The single pane of glass solutions present this disparate data on a single dashboard – but in essence we are seeing graphs generated from different data sets simply displayed on the same page.

What we really need is the ability to be able to look at the relationships between these different data sets. The cyber security world has been at the forefront of data set correlation seen in some of the more advanced Security Incident and Event Management (SIEM) tools. Most SIEM tools can take information from multiple sources, say in the form of logs, and search for patterns or ‘correlations’ between them providing deep insight into existing and potential security issues.

The network performance monitoring space has been slower to adopt this approach – I suppose because we have been grappling with large data sets that seem discrete from other data sources. But … (and this is important) the key to providing ‘network visibility’ as opposed to simple ‘network monitoring’ is the ability to correlate data from multiple sources in order to present a whole of network picture to assist in problem diagnosis and resolution, as well as security incident reporting.

Effective correlation relies on the ability to ‘link’ to data sets to glean deeper insight. From a networking perspective, the most obvious common link is IP address. Most network related data sets, be they performance monitoring or cyber security, are IP address based which allows an easy and quick point of correlation.

Certainly IP address correlation is a good start, but there is a more interesting, and arguable more effective way. Corelight have developed a technique to pivot between datasets for more effective visibility called Community Flow ID Hashing (https://github.com/corelight/community-id-spec). Community Flow ID Hashing is a technique that takes flow tuple information to create a hash value or ID to identify a flow that is common across data sets.

Corelight uses this to correlate between datasets from Suricata and Zeek to provide detailed cyber incident analysis. At Byte25 we have taken a similar approach by implementing the Community Flow ID within the Byte25 deep packet inspection engine. In this way, Byte25 can correlate data between the Suricata Threat Detection Engine and the network performance data collected from the DPI engine.

This provides deep network visibility well above the simple ‘single pane of glass’ approach. For example, if an incident is identified in the Byte25 Threat Detection engine, it is a simple process to use the Community Flow ID to pivot to the DPI data, set a filter and identify exactly when and how often the malicious device has communicated within the network and who else may have been affected. That is, a simple correlation that allows powerful diagnostics and forensics to identify and potential remediate identified issues.

Single pane of glass may be a great marketing term, but true network visibility relies on the power of correlation. Normalising disparate datasets through techniques such as Community ID Flow Hashing is a great step forward in allowing pivots between different data sources.

A great step forward for network visibility.