Created: April 3, 1999; Les Cottrell and Warren Matthews. Last updated on May 22, 1999
|
|
|
To estimate the magnitude of the error introduced, we pinged from one host (minos.slac.stanford.edu) to another host (hermes.slac.stanford.edu) on the same subnet and compared the RTTs reported by a sniffer on the wire with those reported by ping. Minos is an IBM RS/6000 250/80 running AIX 4.1.5, and Hermes is a Sun Sparc 5/70 running SunOS 4.1.3.1. Minos is the host from where we normally run the PingER monitoring code at SLAC. The resolution reported by the sniffer and ping is 1 msec.
The results of the measurements for about 1500 pings are shown
in the figures below.
The first figure shows a histogram of the frequency of the wire RTTs. The line is a fit to
a power series with the parameters shown and a correlation coefficient
(R2) as
shown. The second figure shows a similar distribution for the RTT as reported by the
ping process running on Minos. In both cases the averages, standard deviations, medians
and the time range into which 95% of the meaurements fall are also shown.
It can be seen that there is a difference of about 1 msec (1.053-0.56) in the averages, the
medians are the same, and the time range that includes 95%
of the measurements is 1 msec in
both cases. The two distributions are also seen to have similar standard deviations (0.56 and 0.4
msec.).
We thus report the 'wire time' versus 'host time' systematic error as 1 msec. and the calibration error as 1 msec. for this pair of hosts.
We also made a test of 12,000 pings between the SLAC PingER monitoring host (minos.slac.stanford.edu) and a host at LBNL (hershey.es.net). In this case the difference between the RTT measured on the wire and that reported by ping was always < 2 msec. and there were pings with RTTs of > 100 msec. Thus at least in this sample, the long RTTs were not related to monitoring host delays.
The pathchar behavior between a host on the same subnet as minos (minos is an AIX host and pathchar does not run on it) and hershey is shown below:
>pathchar -q 64 hershey.es.net pathchar to hershey.es.net (198.128.1.11) mtu limitted to 8192 bytes at local host doing 64 probes at each of 64 to 8192 by 260 0 FLORA03.SLAC.Stanford.EDU (134.79.16.55) | 77 Mb/s, 462 us (1.77 ms) 1 RTR-CGB5.SLAC.Stanford.EDU (134.79.19.3) | 294 Mb/s, 218 us (2.43 ms) 2 RTR-CGB6.SLAC.Stanford.EDU (134.79.135.6) | 18 Mb/s, 276 us (6.53 ms) 3 RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4) | ?? b/s, -85 us (2.44 ms) 4 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.18) -> 192.68.191.18 (1) | ?? b/s, 1.42 ms (5.13 ms) 5?lbl1-atms.es.net (134.55.24.11) | 245 Mb/s, 71 us (5.54 ms) 6 esnet-lbl.es.net (134.55.23.66) | 9.7 Mb/s, 95 us (12.5 ms) 7 hershey.es.net (198.128.1.11) 7 hops, rtt 4.91 ms (12.5 ms), bottleneck 9.7 Mb/s, pipe 42418 bytes
One might assume that self pinging would give an overestimate of the RTT overhead caused by the monitoring host (since in the self ping case: the host has to create and send the echo request; receive the echo request process it and echo the response; and receive it back again and process the response - whereas in the non-self ping case the monitoring host does not have to handle receiving the request processing it and sending the response). However, when a host self pings itself, the packets do not usually go onto the physical network. As a result the client does not usually have to do an I/O wait to get the response back. Typically one would expect self ping times to be of the order of 0-3 msec. For example in the case of the SLAC PingER monitoring host (minos) the probability of the self-ping RTT being >= 1 msec is < 0.1%.
In the non-self ping case the client will probably have to do an I/O wait to read the response back. If the client system is busy, then there can be considerable delay before the client task is dispatched again. This may occasionally lead to much longer RTT times being recorded by ping than are seen on the wire. For example, 10,000 pings to hershey.lbl.gov from minos.slac.stanford.edu on 4/5/1999 gave an Inter Quartile Range (IQR) of 1 msec (2 msec.), a median of 4 msec (5 msec.), that fell off exponetially between 4 msec & 25 msec with an R2 of 0.93 (or in other words a good fit), but had 13 (15) RTTs > 30 msec with a fairly flat distribution from 40 to 360 msec.(the numbers in parentheses in this sentence are for the reverse direction). For both these hosts (minos and hershey) the self-pings had a probability of < 0.1% of being > 1 msec. The large RTTs of these few outliers suggests that using the maxima, or any statistic that is sensitive to large outliers is not a good idea.
Thus an alternative is for the host to monitor a second host at the same site in order to get a more realistic idea of the impact of the host on the RTT (this assumes that a response time of a few msec. will be sufficient to cause the ping client to do an I/O wait). When we do this between minos and ns2 (a Sun Sparc 5/70 running SunOS 4.1.3.1), two hosts on the SLAC LAN separated by one router hop, we get a median RTT of 3 msec., an IQR of 1 msec. and a probability of < 0.01% for an RTT of > 10 msec.
Providing an estimate of the error introduced by the remote host requires using a sniffer to measure and compare the wire times of the incoming echo request time with the outgoing echo response time. This is expected to be small (< 1 msec. - for example, on a 166MHz Intel Pentium running Windows 98 it is about 350 usec.) since the echo response is usually done in the kernel and should not require an I/O wait to turn-around the packet. If one is measuring Internet traffic, usually the turn-around time of the remote host will be small compared to the complete ping RTT so it may only be necessary to ensure that this is so, i.e.. ensure the turn-around time is < 1-2 msec, when one is measuring response time of 50 msec. or greater (assuming a systematic error contribution from the monitoring host of < 4% is acceptable).