1. Introduction
TCP was designed to have some sort of reliable connection-oriented data streams as a goal. That way, it could overcome some of the best-effort IP protocol design limitations. In that sense, it has flow and congestion control, priority flagging, error detection, and recovery. It can also break large packets into fragments if needed. In this case, the fragments will be reassembled at the destination.
So we know that TCP should be able to detect errors to some degree. How does it do that? And how effective it is? Can the detection fail? If it does, how can we work around it?
In this tutorial, we’ll explore those questions and discuss some ways to improve our TCP relying systems robustness with better error detection or even correction.
2. Checksum in TCP Packets
The table below shows the TCP packet header. As we can see, there is a field called Checksum. This field is used to store a 16-bit checksum. It’s calculated using one’s complement of parts of the IP header, the TCP header (checksum field is assumed to be zeroed), and the packet’s payload.
Checksum using one’s complement was not the most robust error detection algorithm at that time. However, it is very very fast to evaluate, with almost no memory overhead. Speed and memory complexity were pressing issues when the Internet protocols first days:
When the receiving endpoint detects a checksum mismatch, it discards the received packet. In consequence, the sender will never get its Acknowledgment. Thus, the sender will retry sending the packet after a timeout. It is the TCP’s three-way handshake expected behavior. Therefore, as with any kind of data loss on TCP/IP, this shall impose lagging and random delays on the data flow.
3. Collisions
The one’s complement is probably the first main hashing algorithm developed. One important concept that arises when we’re evaluating hashing algorithms is Collision. A Collision happens when two different objects evaluate to the exact same hash code. This is especially harmful when the hash codes are used for error detection or uniquely identifying any object representation.
In any application relying on hashing, a collision means that system can be fooled to think that two different pieces of information are the same. For instance, passwords are usually stored as some sort of hash. In that context, a collision means that an attacker might gain access, not only by guessing the exact password but also by guessing any other byte-sequence that evaluates to the same hash code. That way, brute-forcing might need some somewhat fewer tries.
In any case, TCP uses a 16-bit checksum, is this any good? In fact, the main checksum criticism is that the packet difference to generate the same hash code is quite low. Any multiple of 16 different bits on the packet leads to the same checksum. More robust algorithms shall need fairly different packages to create a collision.
Note that in the case of a collision, the receiver could get behaviors ranging from some gibberish in the payload data to not receiving the data at all, if the changed bits were on the packet header. But, in fact, the actual risk depends on the reliability of the Internet routes the packets flow through. Higher quality links, with low Bit Error Rates, reduce the odds of having multiples of 16 bits being wrongfully detected in the same packet.
4. Application-layer Workarounds
4.1. Improving Error Detection
Of course, there are applications in which we might not want to let the slightest chance of error not being detected. In this case, we can add more robust error detection in the application layer messages. We can use the relatively fast and reliable Cyclic Redundancy Check – CRC32 algorithms (used in ZIP files, for instance). Furthermore, to achieve higher protection, we can use even stronger hashing algorithms, like SHA-256 (the same that is used by some known crypto-currencies).
The more robust the detection, the more overhead and higher the latency and computing power. One very good example of CRC protection at the application layer is the HTTP compression algorithm (defined in RFC 7231). If the application uses an application-layer protocol of our own design, it’s not difficult to add plain CRC, a CRC-protected data-compression, or even a cryptographic hashing code.
While choosing a hashing algorithm, we must decide how far we need to go to ensure the needed protection. Algorithms once believed to be secure, like MD5, have been long cracked. There are tools that can quickly create colliding byte-sequences with the target hash code (see hashcat, and johntheripper).
Another alternative would be to use the Stream Control Transmission Protocol – SCTP (defined in RFC 4960). Designed to replace TCP, it has better fault tolerance using a CRC32 hash code. However, its use has not fulfilled the hype. And, as it is not as commonly used, we might face trouble flowing through firewall-protected networks with strict configurations.
4.2. Improving Error Correction
We can go even further, adding redundancy to the data, it is possible not only to detect errors but also recover without the need of retransmission using a technique called Forward Error Correction. Since error correction algorithms can be designed to arbitrarily define the desired robustness, the more robust, more redundant data, and overhead.
By the way, the most common packet errors on TCP will definitely lead to packet discards by the network kernel subsystem. So they will force retransmissions. In order to avoid any retransmission whatsoever, we should use a transport protocol that allows us to have better control over its error detection mechanism.
For instance, we could use UDP and craft the packets directly on the application without its optional checksum. The need comes because modern operating systems do fill, by default, the optional 16-bit checksum field on UDP.
5. Conclusion
In this tutorial, we’ve seen that there is, in fact, some relatively remote possibility of error detection failing on TCP. But, as most application layers protocols or file types already add additional security, in a common scenario it is not really a concern.
Regardless, if we do need robustness assurance or are designing our own mission-critical application, we can improve its reliability arbitrarily to any requirement, granted that we can afford the associated overhead.