Sunday, August 30, 2009

Xilinx TEMAC Checksum offload programming sample

Programming with Xilinx TEMAC Checksum offload engine was bit complex, since it calculates only TCP and UDP checksum on transmission and gives the raw checksum on reception. So, I just tried to explain the logic behind the TCP/UDP Checksum Off load in Hardware with a sample UDP packet transmission and reception.

For Transmission

Step 1:

In UDP layer, Do not forget to fill Checksum field of UDP header of the packet with zero.
Calculate the Pseudo Checksum and send it to driver.

How to calculate pseudo_csum?

unsigned int pseudo_csum;
unsigned short *iphdr_ptr;

pseudo_csum = *iphdr_ptr ++; /* Source IP Address (First two bytes) */
pseudo_csum += *iphdr_ptr ++; /* Source IP Address (Last two bytes) */
pseudo_csum += *iphdr_ptr ++; /* Destination IP Address (First two bytes) */
pseudo_csum += *iphdr_ptr ++; /* Destination IP Address (Last two bytes) */
pseudo_csum += htons(UDP_PROTOCOL_ID); /* UDP Protocol ID: 0x11 */
pseudo_csum += udp_length; /* UDP Packet Length (Data + Heaader Length) */
pseudo_csum = (pseudo_csum & 0xffff) + (pseudo_csum >>16);
pseudo_csum += (pseudo_csum >>16);

return (unsigned short)pseudo_csum;

 Step 2:

Set the Transmit buffer descriptor with these extra settings and transmit.

TransmitBD.APP0 = TransmitBD.APP0 | TX_CSCNTRL;
TransmitBD.APP1 = (TX_CSBEGIN << 16) | TX_CSINSERT;

TransmitBD.APP2 = pseudo_csum;

 TX_CSCNTRL is 0x01
TX_CSBEGIN is 34 for IPv4/UDP. It is starting of UDP Header (Ethernet Header size (14) + IP Header size (20)).
TX_CSINSERT is 40 for IPv4/UDP. It is Checksum field offset (starting of UDP header(34) + Checksum offset (6)).

For Reception

In reception, the hardware checksum offload engine just adds the IP packet data in 16 bits. So, subtract the IP header, subtract the UDP checksum, calculate and add the pseudo header and verify with the checksum in the packet.

unsigned short *ip_data = (unsigned short *)(RecieveBD.Buffer + 14);
unsigned short *udp_hdr = (struct udp_header *)(RecieveBD.Buffer + 14 + 20);
unsigned short packet_csum, hw_csum;
unsigned int temp;


Take the hardware generated checksum and shift 16 bits left.

temp = RecieveBD.APP3 & 0xffff;
temp = temp << 16;


Subtract the first 12 bytes of IP header (except the Source and Destination IP addresses: pseudo header)

for (i = 0; i < 6; i++) {
temp -= ip_data[i];
}


Add the UDP protocol ID(0x11) (pseudo header).

temp += htos(UDP_PROTOCOL_ID);

Subtract the UDP checksum. And keep it for later validation.

packet_csum = udp_hdr->check_sum;
temp -= packet_csum;


Add the UDP packet length (pseudo header).

temp += udp_hdr->length;
temp = (temp & 0xffff) + (temp >> 16);
temp += (temp >>16);


Compare the result checksum (16 bits)

hw_csum = (unsigned short)temp;
if (hw_csum != 0xffff)
hw_csum = ~hw_csum;

if (hw_csum == packet_csum)
Checksum passed;
else
Checksum failed;


Send the result to upper layer. Just follow the algorithm for TCP and IPv6 too. For optimized solution and pseudo code for checksum verification, read the following post:


http://embeddedknowledge.blogspot.com/2011/07/xilinx-temac-checksum-offload.html

If you have any queries write as comments.

Friday, August 28, 2009

Japan Robots

Welcoming Guests and distributing tissue papers.

Walking and shaking hands
Arranging the world famous japanese dish `Sushi`.

" I am ready to prepare Okonomiyaki (Japanese dish)"

Distributing the food. (I wonder how it balances to carry the weight and roll around)

"Hey! Here is the show.."

Don`t add him in robot list. But, his T-shirt was interesting..


Thursday, August 27, 2009

Checksum error: differs by 1 or 2

When you calculate checksum, if it varies by just with last two digits (value varies by 0x01 or 0x02 or 0x03), then the problem is that the carry has not been added.
For example, the correct checksum is 0x5e94. But, your calculation is 0x5e96. To solve this problem, check whether the carry has been added.

csum += *(unsigned short)data;
:
csum = (csum & 0xffff) + (csum >> 16); /* Please add this carry too */
csum = csum + (csum >> 16); /* If the carry is again generated, it has to be added */

Wednesday, August 26, 2009

Software design and implementation for TCP/UDP/IP Checksum offloading Interface

        Most of modern NICs (Network Interface Cards) are with Giga bit speed capability and come with TCP/UDP/IP Checksum offloading support. Embedded Operating systems can no longer postpone the integration of checksum offloading support with their drivers and protocol stacks.

        When referring with few controllers such as Xilinx TEMAC and PowerPC TSEC, it is clear that the extent of support provided by each widely varies and it is not a standardized one. In turn, it complicates the design and implementation of a standardized software interface that can support all kind of controllers.
Varying features of checksum offloading
Supported Layers (TCP/UDP/IP)
Some support only TCP/UDP. Some support IP too.
Support for packets with Options header
Some controllers do not calculate the checksum for packets with options header.
Checksum calculation for UDP/TCP Pseudo Header
Most of controllers which do not support IP layer and Options header expects the protocol stack to calculate the UDP/TCP Pseudo header checksum and seed them with.
Driver Interface specification
Driver interface such as input parameters and output format of the Checksum Offload Engines vary, though mostly interfaced with buffer descriptors.
Support for fragmented support
Some support fragmented packets and some do not.
Error packets handling
Whether the hardware rejects the erroneous packets or leaving it to the software also varies.
VLAN packets support
Some support and some do not support.


        From the above table, it is clear that the protocol stack can not fully depend on the hardware engine for checksum calculation. There are packets which are not supported by the Checksum Offload Engine and they have to go through the software checksum calculation.

        In this article, I try to design a standardized software interface for the checksum offloading functionality. The implementation is divided into three modules called Configuration, Outbound flow and Inbound flow.
  • Configuration
        It is about advertising the abilities of the controller's Checksum Offload Engine(Let's say COE. I hesitate to name it as TOE(TCP offload engine) since it offloads UDP checksum calculation too) across the protocol stack. In other words, it is about initializing the network interface structures with the capability of the Checksum Offload Engine so that each protocol layer can refer whether the COE supports that particular layer or not and process the packets accordingly. The main capabilities to be advertised are: Which layers are supported (UDP/TCP/IP)? What is the extent of support(Partial/Full)(Partial means checksum offload controller does not support pseudo header checksum calculation)? Does it support fragments or not?
Interface configuration
COE supports IP?
COE supports TCP?
COE supports UDP?
COE support is full or partial?
COE supports fragmented packets?


        How this information is maintained in the network protocol stack is implementation specific. However, this article suggests to store the information as flags in the network interface structure and the driver can do the initialization job. Ok! How this information can be used at the protocol stack? While sending and receiving each packet, each layer refers to the above flags to know about the ability of the COE and do the processing accordingly. So, three types of processing must be possible by the TCP/UDP layer: 1) Software 2) Partial 3) Full. And, IP layer must do two types of processing 1) Software 2) Full, where each type is explained as below.

        Software: If the COE does not support the particular layer or the packet type (for example, fragmented packets), the checksum will be calculated as usual by the software routine.

        Partial: This is special case mainly for TCP and UDP layers. Some COEs support TCP and UDP checksum calculation, but they demand the protocol stack to calculate and feed them with the pseudo header checksum alone(What a pity!). In this case, TCP and UDP layers need to calculate only just the pseudo header checksum and send it to the driver.

        Full: The COE calculates the whole checksum for a particular layer. The software routine does not need to do anything.

        Done ! Now, COE abilities are maintained in the protocol stack. Let's see how to process each packet.
  • Outbound flow and parameters
        Generalizing the input parameters for the controllers yields the following table of input parameters that need to be passed to the driver with each packet.
Outbound parameters
Layer3 type = IPv4 or IPv6?
flag
Layer4 type = TCP or UDP?
flag
Layer3 calculation by COE?
flag
Layer4 calculation by COE?
flag
Should calculate Pseudo header?
flag
Byte offset for layer4 start
offset
Checksum offset for layer4
offset
Checksum offset for layer3
offset
Pseudo Header Checksum
data


        But, some of the parameters can be easily calculated/fixed by the driver rather than sending all the way with the packet. If that optimization is done, the list becomes as follows:
Optimized outbound parameters
Layer3 type = IPv4 or IPv6?
flag
Layer4 type = TCP or UDP?
flag
Layer3 calculation by COE?
flag
Layer4 calculation by COE?
flag
Pseudo Header Checksum
data


        Now, the UDP and IP checksum calculation logic in the protocol stack become as follows. All the the above parameters are passed to the driver and the driver sets the COE using these parameters. Everything is over. Packet will come out of the controller with calculated checksum.

UDP
Set UDP Checksum field to 0.
IF (COE Supports UDP? is Yes)
{
    IF (this is fragmented packet AND COE supports fragmented packet? is False)
    {
        Calculate by software
    }
    /* Just send the packet. Let COE calculate the checksum */
    Set Layer4 type = UDP;
    Set Layer4 calculation by COE? = Yes;
    IF (COE support is partial)
    {
        Pseudo header checksum = Calculate just pseudo checksum
    }
}
ELSE
    Calculate by software


IP
IF (COE Supports IP? is No)
{
    Calculate by software
}
ELSE
{
    Set Layer3 type = IP;
    Set Layer3 calculation by COE? = Yes;
}

TCP logic will be the same just as the UDP. And, how to pass all these information to the driver is implementation specific. However they can be passed as flags and data bytes as specified in the above table with the packet structure. 
  • Inbound flow and parameters
        Some controllers gives the calculated checksum and some notifies whether the checksum verification is passed or failed. And some controllers drop the erroneous packets. So, as a general rule, this article suggests to verify the checksum at the driver level and just drop the packets with checksum error. And, no parameters are passed from the driver to the upper layer. So, it becomes clear that only two types of packets are sent to upper layer. 1) Checksum verified correct packets 2) Packets unsupported by the COE(for example, fragmented packets). So, in the UDP and TCP layers check the checksum of all fragmented packets by software. So the logic becomes as follows:

IP
IF (COE Supports IP? is No)
{
    Verify by software
}

/* All received are correct packets, when COE is enabled */


UDP
IF (COE Supports IP? is No)
{
    Verify by software
}
ELSE IF (packet is fragmented AND COE supports fragmented packet? is False)
{
    Verify by software
}

The algorithm for TCP will be the same just as the UDP. That is all. Big job Done!!

(Please leave your comments on this article. That will help me to improve this. See you!)

Wednesday, August 19, 2009

Love letter

An embedded programmer wrote a love letter "Dear! Since the day you are powered-on, I`m loving you. Your eyes used to glitter like LEDs. But why do you emit heat whenever I approach you. Don`t worry, I will be a heat sink forever. I always want to hang on your shoulders. Come on! Embed me in your heart. Tell me whatever problem you have, we can debug it. Expecting your positive reply. "

He never got reply. Because, she had made reckless run with another guy.

How to


One day an embedded engineer committed suicide. No clue why he did so. In his office, a new programmer take over his job and look over his code. There was a comment in his code.
/* Junk board! You always hang here. Wanna show you really how to hang */

Saturday, August 15, 2009

Embedded systems Coding: Issues and techniques - Part 1

In Embedded systems, you may wonder that though you have done logically correct coding, it is not working practically on board as you expect. Could you imagine what are all things may cause such problems in embedded systems? Are you aware of coding issues related with compiler optimization, I/O synchronization, Cache write-back, edge triggered interrupts, etc.,? Let me guide you how-to code with some of these issues.

Optimization is an unavoidable one to improve the speed and to reduce the size of the code in embedded systems. But, the compiler will have strict eyes on your code and generate more tricky code which may result in unexpected result. Look at the following typical case:

What do you intend to do in the above code? It is a polling where each time it should read the status register and check for the status update by the hardware. But the compiler will think in more smarter way that why to read the SAME memory location each time unnecessarily since the content is going to be the SAME. The compiler will not know that its content is going to be changed by hardware after sometime. So, it is dare enough to generate the assembly code which is equivalent to

which may result in infinite loop in case the IO_COMPLETED flag is not set when the code reads the status register for the first time.
So, what do you have to do? You have to understand that IO mapped memory is different from the real memory and its content is subjected to change internally by hardware without any external CPU write. So, you have to teach the compiler to treat them separately.

Declare the IO memory locations as volatile so that the compiler will not optimize the read/write operations of such volatile memory locations. So, the above code will work perfect with optimization.