Monday, February 15, 2010

PPC405GP MMU: Problem in switching between Real and Translation modes

I had my BSP software for PowerPC PPC405GP processor working in Real mode just configured with 128 MB (0x00000000 – 0x07FFFFFF) Region as Cache enabled(ICCR=0x80000000; DCCR=0x80000000; DCRW=0x0; SGR=0x0; SU0R=0x0; SLER=0x0 ). To create both cacheable and non-cacheable memory regions within SDRAM(128 MB), it was necessary to implement Translation mode where TLB entries with smaller size pages can be created and different memory attributes can be set for each entry.

Enabling the Translation mode involves 1) Setting of the TLB entries 2) Setting the MSR register bits MSR[IR] and MSR[DR].

Howto set the TLB entries?

The following assembly routine gets Index, TLBHI(tag) portion and TLBLO(data) portion as input parameters and write the TLB entry.

    set_tlb_entry:           ; set_tlb_entry(tlb_index, TLBHI, TLBLO)
        sync
        tlbwe r4,r3,0        ; Write TLBHI
        tlbwe r5,r3,1        ; Write TLBLO
        isync

Using the above routine, I write the following entries.

/* TLB Index=0 */
/* EPN=0x00000000;Size=4MB;Valid=1;Endian=0;U0=0 */
/* RPN=0x00000000;EX=1;WR=1;ZSEL=0;W=0;I=0;M=0;G=0 */
/* This entry maps 0x0 physical address to same 0x0 virtual address;
    Enables the cache and grants Execute and Write permissions.
    (SDRAM for execution) */
set_tlb_entry(0, 0x00000340, 0x00000300);

/* TLB Index = 1 */
/* EPN=0x00400000;Size=4MB;Valid=1;Endian=0;U0=0 */
/* RPN=0x00400000;EX=0;WR=1;ZSEL=0;W=0;I=1;M=0;G=0 */
/* This entry maps 0x400000 physical address to same virtual address;
    Disables the cache and grants only Write permission.
    (SDRAM for IO Buffers) */
set_tlb_entry(1, 0x00400340, 0x00400104);

(Now, I have the vector table and execution code from 0x0 address.) And, it is time to enable the translation mode by setting the bits MSR[IR] and MSR[DR]. But, before doing this I must invalidate the instruction cache; flush and invalidate the data cache. Because, they have been used in different memory configuration by the Real mode. And, they are going to be used in different configuration by the TLB entries. Ok! Caches are flushed and invalidated. Then? I thought the Real mode settings can be reset to ICCR=0 and DCCR=0, since because once the Translation mode is enabled, Real mode settings become useless. So, what I did was I cleared all Real mode settings and enabled the Translation mode. But, when I ran my code at one location I got the exception thrown. Even I know the location, I am clueless why the exception thrown from that place. Because, the execution pass through that location only once at initialization. The execution went fine with the initialization. But, how that location is executed after that?

I found one strange thing that when I left the Real mode configuration as it is(Cache enabled for 128MB), the execution went fine. While thinking myself why the Real mode settings too needed while the Translation mode is enabled, I got the problem. When the Interrupt occurs at 0x500 or system call at 0xc00, the MSR[IR] and MSR[DR] bits are cleared. And, when these bits are cleared, processor automatically switches to Real mode. Till these MSR bits are restored, the access will be done through the Real mode. So, there is a possibility that same memory area has been accessed by Real mode and Translation mode with different cache attributes. So, the execution goes wrong.
My guess was right! During the system call at 0xc00, the context was stored at memory location 0x5000 when MSR[IR, DR] = 0, and they were stored/restored again when MSR[IR, DR]=1. When this continues, the data become unstable and causes undefined behaviour further.

So, what I did was, I set the same memory attributes for the memory page(first TLB entry) which contains 0x5000 address and the first Real mode region of  (0x0000 0000 – 0x07FFFFFF) so that at both Real mode and translation mode the memory region will be accessed with the same attributes.

So, my code sequence is as below:

  1. At reset, MSR[IR, DR] is 0, and Real mode works with all cache inhibited.
  2. At boot code, first 128 MB region is enabled with cache.
  3. Now to switch to translation mode, Reset Real mode with all cache inhibited.
  4. Invalidate Instruction cache.
  5. Flush and Invalidate Data cache.
  6. Write the TLB entries.
  7. Set the Real mode (registers ICCR, DCCR, DCWR) with same attributes of the TLB entry corresponding to the context page.
This is also applicable to IBM PowerPC 405Fx (PPC405Fx core) and Xilinx PowerPC 405 (PPC405 core) processors.

(For any queries, please write in the feedback column)

Sunday, February 07, 2010

Turbo C compiler Installation on Windows

There is Automatic Turbo C Installer in the following link:

http://webcourse.cs.technion.ac.il/234112/Spring2006/ho/WCFiles/TURBOC30.exe

This will install the compiler in c:/tc folder. You can view your favourite compiler environment by double clicking c:/tc/bin/TC.exe

If you are using Windows Vista, you have to install DOSBox DOS Emulator to work in full screen mode. I faced problem when I double clicked the c:/tc/bin/TC.exe. Then I used the DOSBox and the problem got solved.

(1) Install the software DOSBox ver 0.72 ( 1.2 MB ) (Freeware) from the link below (Direct Link)


http://prdownloads.sourceforge.net/dosbox/DOSBox0.72-win32-installer.exe?download

(2) Before going to the details u have to create a folder (any name will do). Here we name it as Turbo

(3) Copy the TC into the Turbo folder.
(4) Run the DOSBox 0.72 from the icon located on the desktop or from the location of the installation folder

(5) Then u are presented with two screens which look like the command prompt in Windows. One with a Z prompt. You can ignore the other screen.

(6) Type the following commands at the command prompt [Z]:

Mount [Type in any alphabet that u wish except z] [Type the source of the turbo C] press enter

(7) Now , Type in the following commands after the Z prompt:

Z: mount d c:\Turbo\ [The folder TC is present inside the folder Turbo]

(8) Now u should get a message which says: Drive D is mounted as a local directory c:\Turbo\

(9) Type d: to shift to d: prompt . Next follow the commands below

cd TC [The contents inside the folder Turbo gets mounted as a virtual drive (Here D drive)

cd Bin

TC or Tc.exe [This presents u the Turbo C++3.0 screen]

(10) In the Turbo C++ goto Options>Directories> Change the source of TC to the source directory [D] ( i.e. virtual D: refers to original c:\Turbo\ . So make the path change to something like D:\TC\include and D:\TC\lib respectively )
===========================================================

Points to Note:

(1) In order to get the full screen use the key combination of Alt and Enter

(2) When u exit from the DosBox [precisely when u unmount the virtual drive where Turbo C++ 3.0 has been mounted] all the files u have saved or made changes in Turbo C++ 3.0 will be copied into the source directory(The directory which contains TC folder)

(3) It is a good idea to backup your files in the source directory prior to running DOSBox 0.72

(4) For additional help go through the readme file located in the installation folder or look on the website of the DOSBox forum.

(5) Don't use shortcut keys to perform operations in TC because they might be a shortcut key for DOSBOX also . Eg : Ctrl+F9 will exit DOSBOX rather running the code .

UPDATE :

You can save yourself some time by having DOSBox automatically MOUNT your folders

For DOSBox versions older then 0.73 browse into program installation folder and open the dosbox.conf file in any text editor. For version 0.73 go to Start Menu and click on "Configuration" and then "Edit Configuration". Then scroll down to the very end, and add the lines which you want to automatically execute when DOS BOX starts.

Now those commands will be executed automatically when DOS BOX starts!

                                                                                                          - Reference Tech Guru

Hard and Soft Real time systems

As in general, Real time systems have to finish a particular job within the specified time. In other words, Real time systems have time restrictions and they have to pursue the deadline. According to the following factors, they can be classified as Hard and Soft real time systems:
  • the permissible(tolerance) level of not meeting the deadline
  • the usefulness of the result of the job obtained after the deadline is expired
  • the severeness of the penalty paid when the deadline is not met
 In Hard real time systems, the permissible level of not meeting the deadline is almost zero. In other words, the deadline must be met. The result obtained after the deadline is useless. The penalty paid when the deadline is not met is destruction or failure of the system.

In soft real time systems, the permissible level of not meeting the deadline is not zero. The usefulness of the result obtained after the deadline is not zero. It depreciates gradually over period of time. Even when the deadline is not met, the effect is not so destructive physically.

So, Hard real time systems have almost zero flexibility and they have to meet the deadline at any cost. Meet the deadline or failure of the system. And the failure of the system causes extremely high penalty, even loss of human life. The result obtained after the deadline is almost useless. For example, a car engine control system is a hard real-time system because a delayed signal may cause engine failure or damage. Other examples of hard real-time embedded systems include medical systems such as heart pacemakers and industrial process controllers.

Though soft real time systems also have to meet the deadline, they have flexibility. They can change the flexibility level or set an average value. Though there is no damage when the deadline is not met, depend on the application it will have its own cost proportional to the delay. Live audio-video systems are also usually soft real-time; violation of constraints results in degraded quality, but the system can continue to operate.

Friday, February 05, 2010

Processor architectures: Harvard, von Neumann and Modified Harvard architectures

Harvard architecture has separate data and instruction busses, allowing transfers to be performed simultaneously on both busses. A von Neumann architecture has only one bus which is used for both data transfers and instruction fetches, and therefore data transfers and instruction fetches must be scheduled - they can not be performed at the same time.


It is possible to have two separate memory systems for a Harvard architecture. As long as data and instructions can be fed in at the same time, then it doesn't matter whether it comes from a cache or memory. But there are problems with this. Compilers generally embed data (literal pools) within the code, and it is often also necessary to be able to write to the instruction memory space, for example in the case of self modifying code, or, if an ARM debugger is used, to set software breakpoints in memory. If there are two completely separate, isolated memory systems, this is not possible. There must be some kind of bridge between the memory systems to allow this.

Using a simple, unified memory system together with a Harvard architecture is highly inefficient. Unless it is possible to feed data into both busses at the same time, it might be better to use a von Neumann architecture processor.

Use of caches

At higher clock speeds, caches are useful as the memory speed is proportionally slower. Harvard architectures tend to be targeted at higher performance systems, and so caches are nearly always used in such systems.

Von Neumann architectures usually have a single unified cache, which stores both instructions and data. The proportion of each in the cache is variable, which may be a good thing. It would in principle be possible to have separate instruction and data caches, storing data and instructions separately. This probably would not be very useful as it would only be possible to ever access one cache at a time.

Caches for Harvard architectures are very useful. Such a system would have separate caches for each bus. Trying to use a shared cache on a Harvard architecture would be very inefficient since then only one bus can be fed at a time. Having two caches means it is possible to feed both buses simultaneously....exactly what is necessary for a Harvard architecture.

This also allows to have a very simple unified memory system, using the same address space for both instructions and data. This gets around the problem of literal pools and self modifying code. What it does mean, however, is that when starting with empty caches, it is necessary to fetch instructions and data from the single memory system, at the same time. Obviously, two memory accesses are needed therefore before the core has all the data needed. This performance will be no better than a von Neumann architecture. However, as the caches fill up, it is much more likely that the instruction or data value has already been cached, and so only one of the two has to be fetched from memory. The other can be supplied directly from the cache with no additional delay. The best performance is achieved when both instructions and data are supplied by the caches, with no need to access external memory at all.

This is the most sensible compromise and the architecture used by ARMs Harvard processor cores. Two separate memory systems can perform better, but would be difficult to implement.

                                                                                                                           -ARM Information Center

Modified Harvard architecture


A Modified Harvard architecture machine is very much like a Harvard architecture machine, but it relaxes the strict separation between instruction and code while still letting the CPU concurrently access two (or more) memory busses.

The most common modification includes separate instruction and data caches backed by a common address space. While the CPU executes from cache, it acts as a pure Harvard machine. When accessing backing memory, it acts like a von Neumann machine (where code can be moved around like data, a powerful technique). This modification is widespread in modern processors such as the ARM architecture and X86 processors. It is sometimes loosely called a Harvard architecture, overlooking the fact that it is actually "modified".

Another modification provides a pathway between the instruction memory (such as ROM or flash) and the CPU to allow words from the instruction memory to be treated as read-only data. This technique is used in some microcontrollers, including the Atmel AVR. This allows constant data, such as text strings or function tables, to be accessed without first having to be copied into data memory, preserving scarce (and power-hungry) data memory for read/write variables. Special machine language instructions are provided to read data from the instruction memory. (This is distinct from instructions which themselves embed constant data, although for individual constants the two mechanisms can substitute for each other.)

Modern uses of the Harvard architecture


The principal advantage of the pure Harvard architecture - simultaneous access to more than one memory system - has been reduced by modified Harvard processors using modern CPU cache systems. Relatively pure Harvard architecture machines are used mostly in applications where tradeoffs, such as the cost and power savings from omitting caches, outweigh the programming penalties from having distinct code and data address spaces.

Digital signal processors (DSPs) generally execute small, highly-optimized audio or video processing algorithms. They avoid caches because their behavior must be extremely reproducible. The difficulties of coping with multiple address spaces are of secondary concern to speed of execution. As a result, some DSPs have multiple data memories in distinct address spaces to facilitate SIMD and VLIW processing. Texas Instruments TMS320 C55x processors, as one example, have multiple parallel data busses (two write, three read) and one instruction bus.

Microcontrollers are characterized by having small amounts of program (flash memory) and data (SRAM) memory, with no cache, and take advantage of the Harvard architecture to speed processing by concurrent instruction and data access. The separate storage means the program and data memories can have different bit depths, for example using 16-bit wide instructions and 8-bit wide data. They also mean that instruction prefetch can be performed in parallel with other activities. Examples include the 8051, the AVR by Atmel Corp, and the PIC by Microchip Technology, Inc..

Even in these cases, it is common to have special instructions to access program memory as data for read-only tables, or for reprogramming.

The von Neumann architecture is a design model for a stored-program digital computer.For example, a desk calculator (in principle) is a fixed program computer. It can do basic mathematics, but it cannot be used as a word processor or a gaming console. Changing the program of a fixed-program machine requires re-wiring, re-structuring, or re-designing the machine.
                                                                                                                                       -Wikipedia