Saturday, May 17, 2014

SH-4A Memory map

It is very confusing to read the Renesas SuperH architecture user's manual in English version about MMU and memory map of SH-4A. They confuse the address translation and virtual address. They call 0x80000000 ~ 0xbfffffff as virtual address though the address translation does not happen for this address space even if you enable the MMU. I consider these are not virtual addresses but just shadow memory space or memory window which reflects other physical address space. In legacy devices, it just reflected the physical address space of just removing the MSB 3 bits, in other words, 0x80000000 ~ 0x9fffffff reflects the physical space of 0x00000000 ~ 0x1fffffff. Same way, 0xa0000000 ~ 0xbfffffff too reflected the same physical space of 0x00000000 ~ 0x1fffffff with the exception that this window is non-cacheable and the previous one is cacheable. In recent devices such as SH-4A, the physical address space is extended to the 32 bit and the window can be configured to reflect different physical address space through configuring PMB (Privileged Space Mapping Buffer). And note that the cache attributes too become configurable for these memory windows through the PMB.

SuperH architecture does not provide flat and free physical as well as virtual address space. In each it has reserved areas for the architecture specific definitions. For example, it has reserved 0x1c000000 ~ 0x1fffffff (called area7) addresses in physical address space for the architecture specific control registers. Also, it has reserved 0xfc000000 ~ 0xffffffff to shadow the same registers. Similarly, it has reserved some other areas too. We will see those areas in below sections. The virtual address space also is not flat and only some fixed address ranges are a
vailable. 0x00000000 ~ 0x7fffffff, 0xc0000000 ~ 0xdfffffff are the only the two available virtual memory address spaces. Other areas are not available for MMU translation and they just reflect some fixed memory contents over physical address space.

Legacy Physical address space:

Legacy SuperH architecture allows only less than 512 MB of physical address space with the 29 bit address bus. This address space is mostly shadowed in other regions such as P1 (0x80000000 ~ 0x9fffffff), P2 (0xa0000000 ~ 0xbfffffff) and P4 (0xfc000000 ~ 0xffffffff).


Recent Physical space:
In recent devices such as SH-4A, the address bus is extended to 32 bits.

A more detailed one. You can connect devices in all over the 32 bit physical space except the SH reserved space. But, it does not mean that you can directly access your devices with the physical address it has been connected. Two areas 0x80000000 ~ 0x9fffffff and 0xa0000000 ~ 0xbfffffff needs to be mapped explicitly through PMB. See the detailed picture below:
Other short descriptions about each area:

P0, U0:

It spans in around 2GB (0x00000000 ~ 0x7fffffff) of address space. When MMU is enabled, it becomes the largest virtual address space. The underlying the Physical addresses on this area are shadowed in different areas. So, this virtual address space can be mapped one to one with physical address space or in other way.

P1, P2:

These are memory windows for the privileged mode to access different physical address spaces.
These addresses are not translated through TLB. But, the different physical address spaces can be mapped through PMB in recent devices. In legacy device, just reflect the physical corresponding to the 3 MSBs with 0 value.

P3:

This is the only address space where virtual mapping also can be done as well as the connected devices can be accessed directly with the same physical addresses as they are mapped.

P4:

This will have different view on User mode and Privileged mode. In user mode, it give access to On chip RAM (cache)  and store queues. But, in privileged mode, it gives access to TLBs, PMB, Control registers, etc. But, this area is completely reserved for the architecture.

Overall interesting picture from ARM:

Write buffers vs Store queues

Write Buffers:

Do you know NCM? It is just accumulating the data and sending instead of sending each packet separately to improve the overall throughput. But it does not meant that the transfer would be postponed forever till accumulating the enough data. It will happen at next earliest possible timing. (Not like caches which will postpone the write till eviction) Instead of I/O device, it will write on write buffer and write buffer will do it soon. It is similar to a server thread passing the requests to a set of worker threads. For example, the CPU or Cache boss needs to post a letter. He puts the request to James, "Hey James! Just post this letter. I need to do a telephone call". James just went out taking the letter. He may be on the way to the post office. But, Boss is not sure when James will complete the post. Just to make sure, he may say "Hey James! Just post this letter and bring me a class of water!". So, the next task can be a dummy read or some valid transaction. But, to make sure that James does all the tasks in orderly fashion, the memory type needs to be set as "Device".

https://www.google.co.jp/#q=Write+buffer+dummy+read

"Strongly ordered" memory type is CPU boss himself is doing all the stuff in orderly fashion.

Store Queues:

Store Queues are considered as one-way (write only) Cache which can do burst write access with memory.

An LCD display system will be used as an example. The CPU draws the picture, the LCD controller displays it.When the underlying content is changed completely, the entire display data must be overwritten. When the cache is being used in the write back mode, previous data is copied to the cache entry because of the cache miss access. The data that was read into the cache is overwritten without being used, so this read operation is wasted. If the store queue is used, there is no read access such as cache fill accesses, only write accesses. 32 bytes size data are written at a time when the data are written back into memory from SQ. For example, with synchronous DRAM, this access becomes a burst access, making a high speed write back access possible. The prefetch (PREF) instruction is used to write back to memory.

Saturday, May 10, 2014

Achieving Fast Boot in Statically built Applications

Strategy:

In domain specific applications such as Automotive where the system start-up needs to be achieved in few hundred millseconds, the application and boot-up sequence needs to be optimized in the following way.

Even though the binary will be one single image, the inner modules can be divided as follows and the corresponding Code, data and BSS sections of each functional module can be scattered and kept with separate sections each with some distinguished start and end symbols.

The boot can be divided into sevaral stages and many modules can be copied into ram and executed on demand, instead of initiating everything at boot-time.

Stage 1: A portion of the executable binary image is loaded and run which is responsible for Reset vector, necessary hardware initialization such as memory map, SDRAM, basic OS initialization and starting up control task

Stage 2: Control task monitors for events and loads and executes the another portion of the binary image which is responsible for a specific functionality.

This is very much similar to making all optional functionalities as Linux modules and loading on-demand. But, this is about how to do in single and static binary image.

Enhancement: Inside each functional module, the initialization functions and termination functions can be defined separately and moved into boot-time initialization module. How to do is described below. This is too similar to module_init() and module_ext() calls.

(Be foolish when asked about what to do. Be wise when asked about how to do.
Products mady by wise people, but made for fools will get great success.)



Monday, May 05, 2014

How modern hardware designs excel software needs?

Initially Intel hardware supported only 32 bit bus which supported only up to 4 GB RAM. Later 4 GB RAM was not enough for servers. So, 4 more bits were added in the hardware bus. Since then, RAM can be extended till 64 GB. In Linux, the Page size is 4 KB. So, remaining 20 bits will be divided into 10-10 bits. Each entry is 2 words (20 bit RPN - 20 bit VPN). Each page table is 8 KB? For Intel, Linux used only 2 level and one dummy level. To support extended bits, 4-10-10 table structure is used. But, still logical address is 32 bits. But, the physical address is 36 bits address. How this is managed? In Linux page table entry, the logical address field remains same as 32 bits. But, the physical address is extended by 4 bits. So, though the physical RAM is 64 GB available, no single process can utilize the whole RAM because its (logical) address space is limited to 32 GB. But, different processes can map different parts of 64 GB RAM into its logical address space and can co-exist in the RAM without any additional overhead of swapping.