The transport level provides end-to-end communication between processes executing on different machines. Although the services provided by a transport protocol are similar to those provided by a data link layer protocol, there are several important differences between the transport and lower layers:
Two solutions:
One simplification is to break the problem into two parts: have transport addresses be a combination of machine address and local process on that machine.
Don't send data unless there is room. Also, the network layer/data link layer solution of simply not acknowledging frames for which the receiver has no space is unacceptable. Why? In the data link case, the line is not being used for anything else; thus retransmissions are inexpensive. At the transport level, end-to-end retransmissions are needed, which wastes resources by sending the same packet over the same links multiple times. If the receiver has no buffer space, the sender should be prevented from sending data.
For datagram-oriented protocols, opening a connection simply allocates and initializes data structures in the operating system kernel.
Connection oriented protocols often exchanges messages that negotiate options with the remote peer at the time a connection is opened. Establishing a connection may be tricky because of the possibility of old or duplicate packets.
Finally, although not as difficult as establishing a connection, terminating a connection presents subtleties too. For instance, both ends of the connection must be sure that all the data in their queues have been delivered to the remote application.
We'll look at these issues in detail as we examine TCP and UDP. NOt too much of OSI terminology as discussed by Tanenbaum.
UDP provides unreliable datagram service. It uses the raw datagram service of IP and does not use acknowledgements or retransmissions.
Need delivery to a process. The first difference between UDP and IP is that IP includes only enough information to deliver a datagram to the specified machine. Transport protocols deal with process-to-process communication. How can we specify a particular process?
Although it is convenient to think of transport service between processes, this leads to some problems:
The solution is to add a level of indirection. Transport level address refer to services without regard to who actually provides that service. In most cases, a transport service maps to a single process.
TCP and UDP use ports to identify services on a machine. Conceptually, ports behave like mailboxes. Datagrams destined for a port are queued at the port until some process reads them, and each service has its own mailbox.
Like all packets we've seen, UDP datagrams consist of a UDP header and some data. The UDP header contains the following fields:
The checksum field is unusual because it includes a 12-byte pseudo header that is not actually part of the UDP datagram itself. The information in the pseudo header comes from the IP datagram header:
The purpose of the pseudo header is to provide extra verification that a datagram has been delivered properly. To see why this is appropriate, recall that because UDP is a transport protocol it really deals with transport addresses. Transport addresses should uniquely specify a service regardless of what machine actually provides that service.
Note: the use of a pseudo header is strong violation of our goal of layering. However, the decision is a compromise based on pragmatics. Using the IP address as part of the transport address greatly simplifies the problem of mapping between transport level addresses and machine addresses.
How are port addresses be assigned?
Note: UDP does not address the issue of flow control or congestion control. Thus, it is unsuitable for use as a general transport protocol.
If UDP datagrams are unreliable then how can we use it in the oracle assignment? Ethernet highly reliable, but really should use timeouts.
TCP provides reliable, full-duplex, byte stream-oriented service. It resides directly above IP (and adjacent to UDP), and uses acknowledgments with retransmissions to achieve reliability. TCP differs from the sliding window protocols we have studied so far in the following ways:
The sending TCP module divides the byte stream into a set of packets called segments, and sends individual segments within an IP datagram.
TCP decides where segment boundaries start and end (the application does not!). In contrast, individual packets are handed to the data link protocols.
The left and right window edges are byte pointers.
1) A conventional ACK indicating what has been received, and
2) The current receiver's window size; that is, the number of bytes of data the receiver is willing to accept.
The presence of flow control at the transport level is important because it allows a slow receiver to shut down a fast sender. For example, a PC can direct a supercomputer to stop sending additional data until it has processed the data it already has.
TCP segments contain a TCP header followed by user data. TCP segments contain the following fields:
Why use 32-bit sequence numbers? Transport protocols must always consider the possibility of delayed datagrams arriving unexpectedly.
Consider the following:
1) A retransmits every segment 3 times. (Perhaps our retransmit timers are wrong.)
2) Packets aren't being lost, just delayed.
Insuring that our sequence number space is large enough to detect old (invalid) datagrams depends on two factors:
One interesting option is the maximum segment size option, which allows the sender and receiver to agree on how large segments can be. This allows a small machine with few resources to prevent a large machine from sending segments that are too large for the small machine to handle. On the other hand, larger segments are more efficient, so they should be used when appropriate.
Note: A TCP segment does not have to contain any data.
The urgent pointer is something we have not encountered before. It allows the sending application to indicate the presence of high-priority data that should be processed ASAP (e.g., drop what you are doing and read all the input).
One example use is when a telnet user types CTRL-C to abort the current process. Most likely, the user would like to get a prompt right away and does not want to see any more output from the aborted job. Unfortunately, there may be thousands of bytes of data already queued in the connection between the remote process and the local terminal.
The local shell places the CTRL-C in the input stream and tells the remote telnet that urgent data is present. The remote telnet sees the urgent data, then quickly reads the data, and when it sees the CTRL-C, throws away all the input and kills the running job. The same type of action then takes place so that the remote telnet can signal the local telnet to throw away any data that is in the pipeline.
Upon receipt of a RST segment, TCP aborts the connection and informs the application of the error.
The PSH bit is requested by the sending application; it is not generated by TCP itself. Why is it needed?
pg 395 Tanenbaum
TCP uses a 3-way handshake to initiate a connection. The handshake serves two functions:
When opening a new connection, why not simply use an initial sequence number of 0? Because if connections are of short duration, exchanging only a small number of segments, we may reuse low sequence numbers too quickly. Thus, each side that wants to send data must be able to choose its initial sequence number.
The 3-way handshake proceeds as follows:
Note: The sequence number used in SYN segments are actually part of the sequence number space. That is why the third segment that A sends contains SEQ=(A_SEQ+1). This is required so that we don't get confused by old SYNs that we have already seen.
To insure that old segments are ignored, TCP ignores any segments that refer to a sequence number outside of its receive window. This includes segments with the SYN bit set.
An application sets the FIN bit when it has no more data to send. On receipt of a FIN segment, TCP refuses to accept any more new data (data whose sequence number is greater than that indicated by the FIN segment).
Closing a connection is further complicated because receipt of a FIN doesn't mean that we are done. In particular, we may not have received all the data leading up to the FIN (e.g., some segments may have been lost), and we must make sure that we have received all the data in the window.
Also, FINs refer to only 1/2 of the connection. If we send a FIN, we cannot send any more new data, but we must continue accepting data sent by the peer. The connection closes only after both sides have sent FIN segments.
Finally, even after we have sent and received a FIN, we are not completely done! We must wait around long enough to be sure that our peer has received an ACK for its FIN. If it has not, and we terminate the connection (deleting a record of its existence), we will return a RST segment when the peer retransmits the FIN, and the peer will abort the connection.
White army to attack the blue. There is no protocol for correctly coordinating the attack. Can only communicate through unreliable means. Is the last messenger necessary? Yes.
There is no satisfactory solution to the problem. Analagous to closing a connection. Best we can do is get it right most of the time.
Transport protocols operating across connectionless networks must implement congestion control. Otherwise, congestion collapse may occur. Congestion collapse occurs when the network is so overloaded that it is only forwarding retransmissions, and most of them are delivered only part way before being discarded. Congestion control refers to reducing the offered load on the network when it becomes congested.
What factors govern the rate at which TCP sends segments?
What value should TCP use for a retransmission timer?
To cope with widely varying delays, TCP maintains a dynamic estimate of the current RTT:
is known as a smoothing factor, and it determines how much weight the new measurement carries. When is 0, we simply use the new value; when is 1, we ignore the new value.
Typical values for lie between .8 and .9.
Because the actual RTT naturally varies between successive transmissions due to normal queuing delays, it would be a mistake to throw out the old one and use the new one. Use of the above formula causes us to change our SRTT estimate slowly, so that we don't overreact to wild fluctuations in the RTT.
Because the SRTT is only an estimate of the actual delay, and actual delays vary from packet to packet, set the actual retransmission timeout (RTO) for a segment to be somewhat longer than SRTT. How much longer?
TCP also maintains an estimate of the mean deviation (MDEV) of the RTT. MDEV is the difference between the measured and expected RTT and provides a close approximation to the standard deviation. Its computation is as follows:
Finally, when transmitting a segment, set its retransmission timer to RTO:
Was originally proposed as 2, but further experience has shown 4 to be better.
Early versions of TCP (4.2 and 4.3 BSD) used a much simpler retransmission algorithm that resulted in excessive retransmissions under some circumstances. Indeed, improper retransmission timers led to excessive retransmissions which contributed to congestion collapse.
The second congestion control mechanism in TCP adjusts the size of the sending window to match the current ability of the network to deliver segments:
TCP uses a congestion window to keep track of the appropriate send window relative to network load. The congestion window is not related to the flow-control window, as the two windows address orthogonal issues. Of course, the actual send window in use at any one time will be the smaller of the two windows.
There are two parts to TCP's congestion control mechanism:
This case is also referred to as congestion avoidance and is handled by slowly, but continually, increasing the size of the send window. We want to slowly take advantage of available resources, but not so fast that we overload the network. In particular we want to increase so slowly that we will get feedback from the network or remote end of the connection before we've increased the level of congestion significantly.
This case is known as congestion control and is handled by decreasing the window suddenly and significantly, reacting after the network becomes overloaded.
To see how things work, let us assume that TCP is transmitting at just the right level for current conditions. During the congestion avoidance phase, TCP is sending data at the proper rate for current conditions.
To make use of any additional capacity that becomes available, the sender slowly increases the size of its send window. When can the sender safely increase its send window size?
As long as it receives a positive indication that the data it is transmitting is reaching the remote end, none of the data is getting lost, so there must not be much (if any) congestion. Specifically, TCP maintains a variable cwnd that specifies the current size of the congestion window (in segments). When TCP receives an acknowledgment that advances the send window, increase cwnd by 1/cwnd.
This linear increase enlarges the size of the congestion window by one segment every round trip time. (The increase is linear in real time because it window increases by a constant amount every round trip time.)
Because the send window continually increases, the network will eventually become congested. How can TCP detect congestion? When it fails to receive an ACK for a segment it just sent.
When the sender detects congestion, it
halves the current size of the congestion window,
saves it in a temporary variable ssthresh, and
sets cwnd to 1.
At this point, slow start takes over.
During slow start, the sender increases cwnd by one on every new ACK.
In effect, the sender increases the size of the window exponentially, doubling the window size every round trip time.
Once cwnd reaches ssthresh, congestion avoidance takes over and the window resumes its linear increase.
Slow start has several important properties:
Slow start guarantees that a sender will never transmit more than two back-to-back packets.
Finally, how does TCP detect the presence of congestion? Because source quench messages are unreliable, TCP assumes that all lost packets result from congestion. Thus, a retransmission event triggers the slow start phase of the algorithm.
Lower-level protocol layers use compact 32-bit Internet addresses. In contrast, users prefer meaningful names to denote objects (e.g., eve). Using high-level names requires an efficient mechanism for mapping between high-level names and low-level addresses.
Originally, the Internet was small and mapping between names and addresses was accomplished using a centrally-maintained file called hosts.txt. To add a name or change an address required contacting the central administrator, updating the table, and distributing it to all the other sites. This solution worked at first because most sites had only a few machines, and the table didn't require frequent changes. The centrally-maintained table suffered from several drawbacks:
The Domain Name System (DNS) is a hierarchical, distributed naming system designed to cope with the problem of explosive growth:
DNS queries are handled by servers called name servers.
In the DNS, the name space is structured as a tree, with domain names referring to nodes in the tree. The tree has a root, and a fully-qualified domain name is identified by the components of the path from the domain name to the root.
In figure cs.purdue.edu, garden.wpi.edu, and decwrl.dec.com are fully-qualified domain names.
The top level includes several subdomains, including (among others):
The DNS links data objects called resource records (RRs) to domain names. RRs contain information such as internet addresses or pointers to name servers.
Resource records consist of five parts:
The following table gives a set of sample RRs in use at WPI. Nameservers: downstage, circa, rata and tahi. Information can be obtained from the Unix program nslookup.
Note: The mail address cew@cs.wpi.edu is valid, even though there is no machine called cs.wpi.edu. Also note that mail to garden and wpi go to bigboote.
How does the SMTP mailer decide which machine to send mail addressed to XXX@cs.wpi.edu?
Name servers are the programs that actually manage the name space. The name space is divided into zones of authority, and a name server is said to be authoritative for all domain names within its zone.
Name servers can delegate responsibility for a subdomain to another name server, allowing a large name space to be divided into several smaller ones.
At Purdue, for instance, the name space purdue.edu is divided into three subdomains: cs, cc, and ecn.
Name servers are linked by pointers. When a name server delegates authority for a subdomain, it maintains pointers to the name servers that manage the subdomain. Thus, the DNS can resolve fully-qualified names by starting at the root and following pointers until reaching an authoritative name server for the name being looked up. (See the DNS record of type PTR.
Note: The shape of the name space and the delegation of subdomains does not depend on the underlying topology of the Internet.
When a client (application) has a name to translate, it sends a DNS query to a name server. DNS queries (and responses) are carried within UDP datagrams. There are two types of queries:
DNS Queries (and responses) consist of four parts:
Conceptually, any application that accesses information managed by the DNS must query the DNS. In practice, DNS queries are hidden in library routines that a user simply calls without having to worry about how they are implemented. In Unix, for example, the routine gethostbyname(3) finds the IP address of a host. Although gethostbyname interacts with name servers, it behaves like a regular procedure call.
The DNS also addresses two important issues:
Note: The owner of a RR manages the caching behavior for its names. Each RR includes a TTL field that specifies for how long a name may be cached. For names that don't change often, long time outs (e.g. several days) are used.
Typically, one name server is designated the master, with the remaining servers designated slaves. The master/slave machines run a special protocol so that slave servers obtain new copies of the database whenever it changes. However, clients may query either masters or slaves.
The top level of the name space includes the domain in-addr.arpa, and machine addresses are reversed and converted into the form: 1.24.215.130.in-addr.arpa, which can be translated by the DNS.