Wednesday, November 17, 2010

ST-Open's Wrappers for 64 Bit Windoze

Wrappers generally are used to reduce calls to external functions to one place. In former times, this was done because all external calls were far calls to other code segments. Things changed a little bit with the flat memory model, but API functions still reside in higher priviledged code segments or are call gates to OS functions running in ring 0, 1 or 2.

Another important issue is the encapsulation of dirty functions, known to overwrite registers without restoring them. As shown in my Intelligent Design paper, a dirty environment slows down execution markably, because frequently used parameters must be reloaded after each call to a dirty function.

Unfortunately, the dirtyness of Windoze and Linux grew with the switch to their 64 bit ABI/API. While only two registers were used as garbage pile in 32 bit programming environments (ECX, EDX), we now have to face the abuse of 8 registers in Windoze environments (RCX, RDX, R08, R09, R10, R11, XMM4 and XMM5) or even 11 registers in Linux (RDI, RSI, RCX, RDX, R08, R09, R10 and XMM4...XMM7). Whenever you pass the mentioned registers to an API function, you can be sure they are returned with changed content. To avoid changed registers, you have to use wrappers. If you do not, you have to reload your parameters over and over, again.

Wrappers are a must if you want to keep your programming environment clean. As a positive side effect, you can customize API calls. Doing so can reduce the amount of arguments to pass drastically - a wrapper can manage things like retrieving the HWND for a dialog item much faster than the programmer, because it already has all registers preserved, saving the time to preserve and restore them more than once.

Windoze 64 Bit Wrapper

My programming environment is the TDM/MinGW64 package, date 2010-05-09. Later packages are broken and deny to work properly. All code snippets are AS sources taken from ST-Open's system library. All functions in this environment must be defined as follows:
.text
.globl   _MyFunction
.def     _MyFunction; .scl 2; .type 32; .endef
 _MyFunction:
To tell GCC to put the following output into the code segment, we have to insert a .text statement on top of our code. If your code references static data in the data or DR segment, their definitions should preceed the .text statement. Never put data into the code segment - it can cause drastical perfomance loss.

Next, .globl statements tell the linker this function is globally available (visible) and external functions can call it. If the .globl is missing, that function only can be called from functions residing in the current source file.

A .def statement provides additional debugging information for the compiler. The .scl defines the storage class, .type 32 means it is a function. This statement probably is superfluous in pure assembler programs, but nevertheless is required if the function shall be callable from external HLL functions.

Finally, the real code of MyFunction() begins at _MyFunction. That is: The address of the first instruction following the label _MyFunction is the start address of _MyFunction. Whenever the linker stumbles upon a call _MyFunction, it replaces the label with a reference to that start address.

That much about HLL compatible goobledygook. Let's continue with some basic thoughts about the internal organisation of a wrapper.

Wrapper Designs

There are two different ways to organise a wrapper - either you provide separate functions with endless repetitions of one and the same proplogue and epilogue (stand-alone), or you use one prologue and epilogue for all functions (collection).

Stand-Alone Wrappers

Providing a separate prologue and epilogue for each function has one advantage: The linker can cut the function's code out of a library and add just that code to the program where the function is called. The downside is a library with tons of redundant repetitions of one and the same prologue and epilogue. Hence, programs using the library are kept smaller, while the size of the library is quite large. Depending on the size of the prologue and epilogue, there is a point where the advantage is eaten up by their repetition. For example, the payload of an API wrapper is about ten percent of the entire wrapper, while the remaining 90 percent are occupied by preserving and restoring clobbered registers.

Let us assume our library has 20 functions, resulting in 20 * 0.1 payload and 20 * 0.9 redundant repetitions. As a result, we have 0.5 percent payload and 99.5 percent overhead. This pays off if only one library function is called. With two calls, we have five percent payload and 95 percent overhead, with four functions 2.5 percent payload and 97.5 percent overhead, and so on. As you can see, this concept looks better at the first glance, but turns out to be a bad design for daily use.

Collected Wrappers

Here we go the other way around. This concept adds bloat if only one function is used, but pays off if we call multiple functions. Both, payload and overhead, now have fixed sizes. The more functions we use, the less our overhead becomes. With two functions, the ratio is 20 to 80 percent, with four functions 40 to 60 percent, and so on. In the end, we can reduce the size of our program by some byte if we use a collected wrapper. Therefore, most ST-Open libraries meanwhile use collected rather than stand-alone functions.

Windoze API

Prologue And Epilogue

With the change to 64 bit, microsoft decided to follow the footsteps of Loonix and abuse more registers as garbage pile. Additional to rCX and rDX in 32 bit code, we now have to preserve R08, R09, R10, R11, XMM4 and XMM5, as well if we do not want to reload the contents of these registers after each API call. I have seen a lot of register dumps throughout the last weeks, so I can tell you those registers definitely are destroyed after each API call. The most important parts of any API wrapper therefore are its prologue and epilogue. The prologue looks like this
    ...

    .p2align 4,,15
  0:subq     $0xB8,%rsp
    nop
    nop
    movdqa   %xmm4,0x60(%rsp)
    movdqa   %xmm5,0x70(%rsp)
    movq     %rcx, 0x88(%rsp)
    movq     %rdx, 0x90(%rsp)
    movq     %r8,  0x98(%rsp)
    movq     %r9,  0xA0(%rsp)
    movq     %r10, 0xA8(%rsp)
    movq     %r11, 0xB0(%rsp)
    jmp      *%rax

    ...
where 0 is the entrypoint for the function declarations placed above the prologue. RAX is set to the address of the real function code at the end of the declaration, preceeeing the jump to 0. Even if it looks quite lengthy, saving all registers is done in about 5 clock cycles (RSP correction plus two write combining sequences). Using six pushes for the GPRs and two movdqas for the XMM registers took about 15 clock cycles, because push works with decreasing addresses, so no write combining is triggered. (two clocks for the movdqas, 13 clocks for six pushes - RSP is available after 2 clocks, only the last push needs all three clock cycles.)

The epilogue is quite similar to the prologue, except target and source of the move instructions are exchanged:
    ...

    .p2align 4,,15
XIT:movdqa   0x60(%rsp),%xmm4
    movdqa   0x70(%rsp),%xmm5
    movq     0x88(%rsp),%rcx
    movq     0x90(%rsp),%rdx
    movq     0x98(%rsp),%r8
    movq     0xA0(%rsp),%r9
    movq     0xA8(%rsp),%r10
    movq     0xB0(%rsp),%r11
    addq     $0xB8,%rsp
    ret
This is trivial code and holds no mysteriously hidden secrets. In concurrence to the prologue, register reads have no accelerating mechanisms like write combining, so it takes about 10 clock cycles until the final return is executed - memory reads and writes are limited to one access per clock cycle. With pops, the prologue was executed in 15 clock cycles (two for the movdqas, 13 for the pops, where only the last one needs 3 cycles, the other are ready after 2 clocks).

All mentioned latencies are valid for PhenomII (family 10), only. For older Athlons (family 8), the push version needs 21 and the pop version 27 clock cycles. Latencies for the Intelligent Design versions are the same for both processor families.

Function Declarations

All function declarations use a stereotype pattern, where only the function names change from declaration to declaration. The following snippet is just an excerpt from my original file:
          .text

          .p2align 4,,15
          .globl   _RegClass
          .def     _RegClass; .scl 2; .type 32; .endef
_RegClass:movq     $rclass,%rax
          jmp      0f

          .p2align 4,,15
          .globl   _RgClassX
          .def     _RgClassX; .scl 2; .type 32; .endef
_RgClassX:movq     $rclssx,%rax
          jmp      0f

          .p2align 4,,15
          .globl   _LdIcon
          .def     _LdIcon; .scl 2; .type 32; .endef
  _LdIcon:movq     $ldicon,%rax
          jmp      0f

          ...
I use symbolic names for all local labels, but you could use GCC-style labels like L00...Lxx, as well. Symbolic names are faster to find if the file includes 60 functions like cap.S, though...

Functions

The functions themselves handle all required tasks, call the corresponding API function and pass the API returncode (RC) to the caller:
          ...

          .p2align 4,,15
   rclass:call     *__imp__RegisterClassA(%rip)
          jmp XIT

          .p2align 4,,15
   rclssx:call     *__imp__RegisterClassExA(%rip)
          jmp XIT

          .p2align 4,,15
   ldicon:call     *__imp__LoadIconA(%rip)
          jmp XIT

          ...
The shown functions just pass the received parameters in RCX, RDX, R08 and R09 to the API, but they may pre-process those parameters, before they finally call the API function. An example:
          ...

          .p2align 4,,15
   ctlshw:call     *__imp__GetDlgItem(%rip)
          movq     %rax,%rcx                 # HWND
          movq     0x98(%rsp),%rdx           # flag
          call     *__imp__ShowWindow(%rip)
          jmp XIT

          ...
A call to CtlSh(HWND, id, bool); shows or hides the control specified by its ID and the handle of the control's parent window (probably a dialog). To speed up execution, the wrapper function first retrieves the control's window handle, then calls the API function to show or hide general windows rather than to call a Widoze macro (which does nothing else than CtlSh(), but probably takes the long winded way).

A Real Epilogue...

I hope I could impart some knowledge about the pro's and con's of old fashioned C-style programming techniques and modern alternatives. the entire file cap.S can be downloaded here: wrappers.7z.

Sunday, November 14, 2010

De-mystifying Windoze UpDown-Controls

Have you ever asked yourself why things are that complicated? If so, you might want to read further...

Windoze UpDown-Controls

Well, microsoft better had called them DownDown-Controls, because they are a cheap imitation of OS/2's sophisticated Spinbutton Controls. When I saw microsoft's documentation the very first time, I just skipped the implementation of my advanced spinbuttons after reading two or three pages. Meanwhile, I started to recreate DatTools, because I urgently need a tool to manage my datafields. I surely can do it with a hex editor, but - it is quite time consuming to enter strings this way...

Being forced to dive deeper into the complicated matter of Windoze' controlled Ups and Downs, I started to insert some experimental code into my raw DatTools skeleton. After writing a tool to retrieve all received messages and put them into a dump file, I finally got a clue how these controls really work. As it turned out, there is absolutely nothing complicated about it - except the HLL hocus pocus which hides really simple stuff behind important sounding goobledygook to keep normal people away from using these controls. Actually, Windoze' UpDown-Controls are less complicated than OS/2's Spinbuttons!

Umm ... this introduction is getting too large, let us start to learn something new. There are several ways to create an UpDown-Control. You can do it the really complicated way and use WinCreateUpDownControl() to create the control, but it is a mess to retrieve the proper x, y, w and h values for this function. The least complicated way to retrieve course parameters is to query the rectangle of the associated edit control, add a few pixel to (x + w) as new x position, use the same y and h and take h plus a few pixel as w parameter. The easiest way (for really lazy sods like me...) is to add one (or several) line(s) like this
CONTROL "", 0x138B, "msctls_updown32", 0x50010044,  85,  18,  15,  12
to your whatever.dlg file. The hexadecimal 0x50010044 is a short version of
UDS_ALIGNRIGHT | UDS_HORZ | WS_CHILD | WS_VISIBLE | WS_TABSTOP
These styles are defined for UpDown-Controls:
UDS_WRAP                        0x0001
UDS_SETBUDDYINT                 0x0002
UDS_ALIGNRIGHT                  0x0004
UDS_ALIGNLEFT                   0x0008
UDS_AUTOBUDDY                   0x0010
UDS_ARROWKEYS                   0x0020
UDS_HORZ                        0x0040
UDS_NOTHOUSANDS                 0x0080
UDS_HOTTRACK                    0x0100
Having done that, we can leave HeLL and start with some real work. Everything begins with processing the WM_INITDIALOG message. In general, you might want to set the upper and lower limits via something like
xorl     %eax,   %eax
xorl     %r9d,   %r9d
movq     %rdi,   %rcx
movl     $0x138B,%edx
incl     %eax
movl     $0x046F,%r8d
decl     %r9d
movq     %rax,   0x20(%rsp)
call     _SnDIM
SnDIM() is a wrapper for SendDlgItemMessageA(), preserving and restoring the eight registers destroyed by Win's API. RDI is my fixed storage for the dialog's HWND - I used RCX previously to pass parameters to other functions not shown here. After setting the lower and upper limits, you might want to tell the UpDown-Control where to start:
incl     %r9d
addl     $0x02,%edx
movq     %r9d, 0x20(%rsp)
call     _SnDIM
As you can see, wrappers save reloading all destroyed registers after each API call. Keep care to pass a zero in R09 (WPARAM), as well! The hexadecimal in RDX is the resource ID of the UpDown, the hexadecimal in R08 the message sent to the UpDown-Control. These are all messages you can send to an UpDown:
UDM_SETRANGE                    0x0465
UDM_GETRANGE                    0x0466
UDM_SETPOS                      0x0467
UDM_GETPOS                      0x0468
UDM_SETBUDDY                    0x0469
UDM_GETBUDDY                    0x046A
UDM_SETACCEL                    0x046B
UDM_GETACCEL                    0x046C
UDM_SETBASE                     0x046D
UDM_GETBASE                     0x046E
UDM_SETRANGE32                  0x046F
UDM_GETRANGE32                  0x0470
UDM_SETPOS32                    0x0471
UDM_GETPOS32                    0x0472
UDM_SETUNICODEFORMAT            0x2005
UDM_GETUNICODEFORMAT            0x2006
This is almost all of the complicated stuff to do. Oh, yes, not to forget - the only thing we have to evaluate after initialising the dialog is the WM_NOTIFY message. Whenever RDX holds 0x4E, we should check if the low word of R08 (WPARAM) is the ID of one of our UpDown-Controls. If so, R09 (LPARAM) holds the address of a 32 byte wide stack location where the following parameters are parked:
00   DQ   hwndFrom  control HWND
08   DQ   idFrom            ID
10   DQ   code      0xFFFFFD2E (UDN_DELTAPOS)
18   SD   iPos      position current
1C   SD   iDelta             new
You either can use iPos directly or add iDelta to the current value of your own parameter. Converting the new value to a hexadecimal or decimal string, selecting an entry from a string table or whatever else you want to do with the returned result is trivial and might be discussed in another post. There are a few messages UpDown-Controls may send in the code quadword:
UDN_FIRST                   0xFFFFFD2F
UDN_LAST                    0xFFFFFD27
UDN_DELTAPOS                0xFFFFFD2E
Now that you learned all important facts about UpDowns, It is time for some remarks.


Tips And Tricks

You might associate your UpDowns with the previous window (buddy window) in the dialog's window hierarchy, which probably is an edit control. If you want to implement some kind of true control over your controls, I strongly recommend not to associate your UpDowns with a buddy window. Let them work as simple controls with the ability to send WM_NOTIFY messages repeatedly as long as one of the arrow keys is pressed.

The most interesting detail is the iDelta parameter. As a matter of fact, this is the only useful thing an UpDown-Control emits. It keeps us informed about which of the buttons is held down currently. It is FFFFFFFF for DOWN and 00000001 for UP, when the UpDown starts spinning, and might be increased if one of the arrow keys is held down for a while. You can control this behaviour by setting the ACCEL structures associated with the UpDown to values suiting your needs. There are at least three ACCEL structure for each UpDown-Control. With these few parameters, it is easy to create very complex structures controlling multiple 'slave' windows with a single UpDown-Control.

The Future Is Less Than A Picosecond Away

It will take more than a few picoseconds to port my advanced spinbuttons to Windoze, but, nevertheless - let me introduce ST-Open's Spinbutton Library. Like all ST-Open Libraries, spinbuttons are controlled via datafields and provide an easy to use interface for assembler programmers as well as for C-style coders. All libraries follow standard C calling conventions. The entire programming interface consists of a single function, one common structure and a few line-filling definitions to keep C(+-*#?) programmers happy. What did we do if we were not forced to scroll the text on our 250 * 15,000 pixel screen sidewards? Quite unbelievable that any line of simple code was not sufficient to fill even 30,000 pixel wide screens, isn't it? But: Such code definitely exists, see above!

Back to business. The old function awaited four parameters
spin number
SPN_* command
numeric input    (optional)
address in/out   (optional)
where command was defined as one of these:
SPN_SET             0x08
SPN_GETCUR          0x07
SPN_GETID           0x06
SPN_GETSTRUC        0x05
SPN_QUERY           0x04
SPN_END             0x03
SPN_DN              0x02
SPN_UP              0x01
SPN_INIT            0x00
As mentioned above, the UpDown delivers all necessary parameters without lengthy requests to API functions. You can pass R08 and R09 through as you got them for SPN_NOTIFY and SPN_EDITED, while you have to provide one parameter for the other commands which are reduced to
SPN_SETSTRUC        0x06
SPN_GETSTRUC        0x05
SPN_QUERY           0x04
SPN_SET             0x03
SPN_EDITED          0x02
SPN_NOTIFY          0x01
SPN_INIT            0x00
where SPN_GETSTRUC and SPN_SETSTRUC only are required for HLL freaks (real programmers know how to read memory blocks without contorted manoeuvres). The positive side effects of datafield driven spinbuttons:
  • Due to the advanced conceptual design, code is greatly reduced and most tasks are perfomed by the highly optimised library function.
  • The spinbutton datafield is loaded automatically with the first SPN_INIT command if it is not present, yet.
  • All spinbuttons automatically start with the values of the last session without any line of extra code.
  • You have to set minimum and maximum values for each spinbutton only once.
  • Changing parameters, including the spinbutton type, is comfortably done with DatTools' spinbutton editor.
That much about the bread and butter side of ST-Open's Spinbuttons. Now the honey pot - the currently available spinbutton types:
SPN_STR             0x08
SPN_DATE            0x07
SPN_TIME            0x06
SPN_HEX64           0x05
SPN_HEX32           0x04
SPN_HEX16           0x03
SPN_HEX08           0x02
SPN_DEC64           0x01
SPN_DEC32           0x00
Are there any wishes left? Oh, well, there are no floating point spinbuttons, yet. As long as there are no FP conversions in my libraries, there are no FP spinbuttons. I never will code one more wrapper to call functions of a C library. Keep dirty programming where it belongs to (e.g. microsoft...). Fullstop.

Final Words

ST-Open's Libraries Version 8.0.0. (64 bit Win) will be available in a few months. If all libraries are tested, they will be uploaded to Google Code and I finally can start the development of IDEOS, the 21st century's operating system.

Addendum

Corrected some errors caused by mixed use of decimals and hexadacimals in some header files...

Monday, November 1, 2010

Intelligent Design in one piece

Copyright Note

The programming techniques introduced in this paper are mental property of Bernhard Schornak. They are protected by international copyrights, published under the terms of the FT4FP-License. Any commercial use, trade or other forms of exploitation to gain profit are strictly prohibited. Knowledge is a common property and should be freely available for every human. It must not be abused as a proprietary ware, only available for those who can afford to feed a few greedy individuals with money.

This document was written for the ST-Open homepage in 2006. It was slightly modified for this blog.


Introduction

Most people probably associate the term Intelligent Design with the movement of Creationism rather than a new, revolutionary programming technique. The usurpation of this term is an intended sidesweep. Whatever invented gods and godesses were able to do - a smart programmer can do it much better. This paper is an introduction to the next generation of programming, superior to old fashioned conventions and programming techniques.

Compared against conventional programming techniques, Intelligent Design resembles a quality leap. However, some knowledge about the creation and management of a conventional stack is required to understand the important difference between old fashioned programming techniques and Intelligent Design. To impart the knowledge about conventional methods to the reader, the next pages offer a detailed introduction to stacks, stack frames and how they are managed. Without this knowledge, it probably is impossible to understand the alternative methods and techniques introduced with Intelligent Design. Old fashioned programming techniques never kept pace with recent processors - the standards of so called high level languages are designed to work with the first generation of microprocessors as well as most recent quad-core machines. Unfortunatelly, computational power of processors grew by several powers of ten, while software standards never followed any technical evolution. Today, we have mature high speed processors driven by never grown software toddlers. It is quite counter-productive to slow down high speed devices because their 'drivers' cannot handle most of the controls.


Basics

The assembly language dialect used in this paper is not the one commonly used in the world of Windows, also known as LETNi syntax. It is AT&T syntax, known in the world of Linux and Unix. When I began to write code for the x86 platform in 1993, GCC was the only free development tool one could get, so I had to use AT&T syntax. If you only know LETNi syntax, there is a short introduction to AT&T syntax to learn the difference between the both. If you worked with AS for a short time, you don't want to return to the complicated and perversed (or was that reversed?) LETNi syntax anymore. The programming techniques introduced in this paper do not rely on a specific syntax. However, knowing AT&T syntax might help you to understand the sample code. Reaching the goal is what really counts. How we get there is another, slightly different problem.

The Stack

Every x86 processor works as a stack machine. Parameters we have to pass to called functions, return addresses to calling functions, local variables and structures are put onto or read from the stack. Unfortunately, most of today's programmers just are used to click together some of those prefabricated code fragments coming along with the daily 20 GB version upgrade for their favourite VisualXYZ(plus-minus-dotnet) development suite. If you ask any of them 'What is a stack?', they probably tell you something about hay or - of course! - money. Sad, but true. Exceptionally sad, because the mechanisms of a software stack are that simple, one were attempted to call the big idea behind nothing else than 'brilliant'.

Whenever you compile a program, you pass a definition file with the extension .def to the linker (LINK.EXE or similar). The definition file holds some important information about the program for the session manager of your operating system. While the compiled program is started, the operating system reserves three independent memory blocks (code, data, stack) for the new process and the segment registers CS, DS+ES and SS are set to the address of one of these blocks. Whatever you defined as STACKSIZE in the definition file, exactly that size is allocated for your stack segment. After allocating the required memory blocks for those three segments, the program code is copied to the code segment, all defined global variables are written to the data segment - they are 'initialised' - and rSP is set to the top of the stack segment. Finally, the address of the array with the command line parameters and the argument count are pushed onto the virgin stack before the session manager calls the main() function of our program. Entering main(), the processor starts to execute the code found there, until it stumbles upon the final ret instruction and passes control back to the session manager.

Entering our program's main() function, the stack looks like this:





The top stack element holds the argument vector argv[] (parameter 2), the next lower stack element holds the argument count argc (parameter 1). The current stack element, the one ESP points to, holds the return address to the session manager. Whenever the program is terminated, the instruction pointer is loaded with this address and the terminating sequence of the session manager is executed. It first frees those resources our program eventually reserved for itself, e.g. allocated memory blocks, open files, open devices, and so on. Finally, it releases the allocated segments and cleans up all structures holding control data of our program.

In the running program, the content of the stack pointer is decreased with every push instruction, the creation of a stack frame or the call to another function. It is increased whenever we pop data from the stack, release a stack frame or return to a calling function. In 32 bit functions, four byte are subtracted from ESP with every push or call, while four byte are added to ESP with every pop or ret. All subtractions or additions are done by the processor automatically, because they are an integral part of the mentioned instructions. Using a picturesque language, we might state: The stack grows with every push or call and shrinks with every pop or ret.

Using The Stack

To make sensible use of the stack, any x86 processor provides the instructions call, enter, leave, pop, push and ret. All these instructions update the content of ESP automatically. Besides these special instructions, ESP can be used like any other register, as well. We are free to add or subtract immediate values or the content of another register to/from the stack pointer and do other funny things with ESP. However, the most utilised action probably is the subtraction of an immediate value from ESP to create a stack frame and the addition of exactly the same value to release that stack frame if we do not need it any longer. A detailed description follows later on, see About Stackframes. For now, we focus our attention on some more basic things like the instructions mentioned above. To understand the concept of conventional programming methods, it is very important to know how these instructions work and how they manipulate the stack and ESP.

Call

If the processor encounters a call instruction, it first subtracts two, four or eight (depending on the standard datasize) from ESP, then stores the address of the instruction following the call on the stack. Next, the address passed as a part of the call instruction is loaded into the instruction pointer. Execution now is continued with the instructions found at the new location, until the processor stumbles upon a ret.
...
          call DoNothing  # call function DoNothing
          xorl %eax,%eax  # <- the address of this instruction
          ...             # is stored on the stack and loaded
          ...             # into EIP with the RET instruction
DoNothing:                # local function DoNothing
          ret             # return to caller...

Enter

Please do not use this vector path instruction - it blocks the processor for entire 13 clock cycles. The usual replacement
...
pushl %ebp       # save EBP
movl %esp,%ebp   # save ESP
subl $0x10,%esp  # create stack frame
...
is executed in four clock cycles. Saving nine clock cycles is a speed improvement of 325 percent. You should prefer the replacement code over enter under any cicumstances! The 16 byte are just a randomly chosen example to keep the sample code valid. The real size you have to subtract from ESP depends on the amount of local variables and other temporary data your function has to store on the stack.

Leave

The leave instruction is equivalent with the following two instructions:
...
movl %ebp,%esp   # restore ESP (DP 1, 2 byte)
popl %ebp        # restore EBP (VP 4, 1 byte)
...
Like pop, leave is a vector path instruction. Because pop blocks the processor for four clock cycles and also has to wait for the valid result of the preceeding mov instruction, leave actually is two clock cycles faster. Moreover, the one byte leave is shorter than its three byte replacement. Hence, you should prefer leave over the alternative method.

Pop




Pop copies the content of the current stack element to a register or memory location, then adds, depending on the processor mode and an optional prefix, two, four or eight to the content of the stack pointer. Unlike the direct path push, pop is a vector path instruction. It is executed in pipes 0 and 1, while pipe 2 is blocked for the time the instruction is processed. Every pop instruction lasts four clock cycles. Pop is a special vector path mov instruction, where ESP automatically is updated after it was used to address the source of a copy operation. In general, pop is used to restore the content of a register. You should keep track of the stack pointer, because it is quite difficult to find errors caused by asymmetrically executed push and pop instructions. Especially, if some of them are inside a loop while others are not.
...
popl %ebx        # restore EBX [=> ESP + 4(!)]
...

Push




Push first adds, depending on the processor mode and an optional prefix, two, four or eight to the stack pointer, then copies an imediate value, the content of a register or the content of a memory location into the stack element the updated ESP points to. Push is a special direct path mov instruction, where ESP automatically is updated before it is used to address the target of a copy operation. While the entire execution time of each push instruction is 3 clock cycles, ESP is available one clock earlier (after 2 cycles) for the following instructions.

In general, push is used to put parameters or register contents onto the stack. It is a good idea to remove 'used' parameters from the stack as soon as possible, because they decrease the available stack size, cause avoidable stack pointer arithmetics, and so on. To remove them, we do not have to pop them - beware! - from the stack. We just have to add the appropriate amount of byte to the stack pointer. If we pushed two parameters in a 32 bit function as shown in the below example, we add 8 byte to ESP. If it were eight parameters, we had to add 8 * 4 = 32 byte to ESP, and so on.

Regardless of the lazy behavior of HLL (high level language) compilers, it is a good idea to correct the content of ESP after each call instruction if you passed parameters to the called function. Humans have minor problems to keep track of the current content of the stack pointer, especially, if the function body exceeds the size of their display. Compilers can do that much better. However, their output is less optimised than the code written by a human.
...
pushl %eax       # put parameter 2 onto the stack
pushl %ebx       # put parameter 1 onto the stack
call _helpling   # call another function
addl $0x08,%esp  # correct ESP directly after CALL
...

Ret(urn)

Whenever the processor stumbles upon a ret instruction, it copies the address stored in the current stack element into the instruction pointer EIP, then adds the standard datasize (2, 4 or 8 byte) to the stack pointer. After ESP was updated, execution continues with the instruction found at the address EIP now points to.
...
ret              # return to caller
...

Stack Management

As mentioned in the descriptions of the single instructions, keeping track of the current content of the stack pointer is a must with highest priority. Because passing of parameters generally is done with push instructions, while parameters are taken from the stack by adding the appropriate amount of byte to the stack pointer ESP, it is very important to handle these operations with exceptional care. Especially the necessary corrections of the stack pointer bear some potential to be erroneous. Unfortunately, such errors cause unexpected crashes and malfunctions, and it is quite hard to track them down until the culprit causing the mess is found. This reaches a new dimension with the creation of a stack frame. What is this already mentioned sinister stack frame?


Stack Frames

When the session manager passes control to our main() function, we surely create a message queue and a frame window, but we also might play some music or those mega-cool animations which are a must for all recent applications swimming along the main stream. Whatever programs might do when they get control over the computer, the first lines of any function, including main(), always follow the set routine. First: A stack frame is built. Second: All used registers are saved on the stack. Actually, ECX and EDX never are saved because the C-conventions say so (see Conventions). Third: The code found in the function body performs all tasks the function is coded for. Fourth: All saved registers are restored. Fifth: The stack frame is destroyed (released). Sixth: The final ret is executed and the function returns to its caller. One thing should be mentioned explicitely: We only need a stack frame if we have to store local variables or other temporary data structures on the stack. Functions without local variables do not need a stack frame at all. Building a stack frame lasts at least 6 clock cycles and occupies 10 byte of code. To save superfluous activities, you should switch on the fomit-frame-pointer option (GCC) by default. It skips building stack frames where they are not required. Grasping how a base pointer is used and how it works is the key to understand conventional programming techniques. Because this is a very important issue, all explanations are very detailed. Some may find them too lengthy, but we should give others the chance to get in touch with all aspects, so (hopefully) anyone is able to gather the knowledge required to apply the learned stuff in real life.

Example

The following example shows how conventional stack frames are created. It is trivial code used in every program. Even huge monster applications like operating systems, browsers, audio studios, et cetera use the same pattern. They only differ in the size of their stack frames.
pushl %ebp               # save old base pointer
movl %esp,%ebp           # load new base pointer
subl $0x08,%esp          # reserve 2 local variables
pushl %ebx               # save EBX
movl 0x08(%ebp),%eax     # EAX = argument count
movl 0x0C(%ebp),%ebx     # EBX = argument vector
movl $0x00,-0x04(%ebp)   # local variable 1 =  0
movl $0x20,-0x08(%ebp)   # local variable 2 = 32
pushl %eax               # copy EAX to -0x10(%ebp)
pushl %ebx               # copy EBX to -0x14(%ebp)
...

The Stack

After execution of the above instructions, the stack looks like this:





ESP always points to the current stack element. In our case, this is the address where the content of EBX was copied to. Depending on the instructions in the function body, ESP moves down towards stack bottom or back towards stack top. In properly designed functions, the content of ESP never can be greater than or equal to the content of EBP. This tells us that the base pointer not only is used to address stack elements. It also marks the border between the current stack frame and the stack frame of the calling function.

pushl %ebp

The first step to create a new stack frame is saving the content of the old base pointer of the calling function on the stack. The content of the old base pointer must not be changed under any circumstances! Otherwise, the calling function uses an invalid basis to address its local variables. It will crash if it tries to read or write data to an address stored in a local variable, because that address, formerly stored at -0x0C[EBP], now might be found at 0x3C[EBP]. At the latest, the calling function commits suicide while trying to return to its caller. It fills the base pointer with random data, then loads other random data into the instruction pointer. The attempt to execute that 'code' raises one of those exceptions and an 'access violation' is reported on the user's screen.

movl %esp,%ebp

In the second step, we copy the address of the current stack element to EBP. The new base pointer now contains the address, where the content of the old base pointer is stored. The next stack element below the new base pointer is the topmost stack element we are allowed to write to. All stack elements above EBP - including 0x00[EBP]! - only should be used to read from, but never to write to! From here on, we use the base pointer to address local variables and parameters passed to our function. In other words, these data are addressed with offsets to the 'frozen' base pointer. As shown in figure 03, the old base pointer is stored at offset zero to the current base pointer. We write this as 0x00[EBP], where the 0x tells us this is a hexadecimal number. Everything related to programming should be written in hexadecimal notation. It is much easier to read, numbers are shorter and always formatted correctly. For example, 0xF100 tells us at first sight: 'It's the second page in memory block 0xF000.' Its decimal equivalent 61696 does not tell us anything. We have to start a calculator to translate it into 0xF100, wasting precious time. Back to the base pointer. Passed parameters and the return address to the calling function are stored in stack elements above the one the base pointer points to. Therefore, they are addressed with positive offsets. Our return address is stored at 0x04[EBP], followed by one or more optional parameters, where parameter 1 always is stored at 0x08[EBP], parameter 2 at 0x0C[EBP], and so on. The 'top down' order is caused by a C-convention, saying that all parameters are pushed onto the stack back to forth, starting with the last parameter the compiler finds inside those round brackets following the name of the called function.

subl $0x08,%esp

The real creation of a stack frame is done with the third and last step. First, we calculate the required size for all variables and structures, then we subtract the result from ESP. Of course, the result must be rounded up to the next multiple of the standard datasize, before we subtract it from the stack pointer. If we subtracted an 'odd' number, the stack became corrupted and was not valid any longer. Everything addressed via ESP, e.g. call or push, was misaligned, causing a lot of penalty cycles. If ESP accidentally was set to 0xFFE2 instead of 0xFFE4 (e.g. by subtracting 0x0A rather than the proper value 0x08) in our sample code, then we push two local variables 0x89ABCDEF and 0x01234567 onto the stack and copy variable 2 from -0x08[EBP] to EBX, EBX finally contained the invalid number 0xCDEF0123 - the doubleword currently stored at address 0xFFE4. With this subtraction, we reserve the subtracted amount of byte on the stack. This reserved area is safe from being overwritten by any following call or push instruction, because ESP always is set to the next lower stack element before a write operation. Because all local variables are lying below the stack element where the old base pointer is stored, they are addressed with negative offsets. The first is stored at address -0x04[EBP], the 2nd at -0x08[EBP], and so on. It neither matters how you name your local variables nor if you count them from top to bottom or vice versa. The only important thing is to remember where which of them is stored.

pushl %ebx

Saves the content of EBX on the stack. Following the C-conventions, the content of EBX, EDI and ESI must be saved before overwriting them and must be restored before returning to the calling function. In other words: The content of these registers must be preserved by all functions (see Conventions).
Urgent Needs
In conventional code, the content of EBP never must be changed. If you cannot avoid to use EBP for general purposes, because you urgently need an extra register, you have to save it before using it. You should restore EBP immediately after passing the bottleneck. Not worth mentioning, but, nevertheless: EBP cannot be used to adress local variables while it holds your private data...
The Function Body
ESP always points to the address of the current stack element. Looking at our example, it is quite obvious that we are going to call another function. Our local variable 1 might be a counter which is incremented each time the called function returns TRUE. Local variable 2 might be a loop counter, so the function might be called 32 times, counting how many times a specific condition was met.

Stack Frame Destruction

When all instructions in the function body are executed, we first have to restore the saved registers. Next, we have to destroy the stack frame to release the area we reserved for our private use. Finally, we return to the calling function. There are two possible ways to perform these tasks:
...
addl $0x08,%esp   # correction after 2 PUSH instructions
popl %ebx         # restore EBX
leave             # destroy stack frame (VP 3, 1 byte)
ret               # return to caller
or
...
addl $0x08,%esp   # correction after 2 PUSH instructions
popl %ebx         # restore EBX
movl %ebp,%esp    # restore ESP (DP 1, 2 byte)
popl %ebp         # restore EBP (VP 4, 1 byte)
ret               # return to caller
Both variants restore ESP and EBP. However, leave is two clock cycles faster and two byte smaller. The correction of ESP is required to restore the preserved content of EBX. Because we pushed two parameters onto the stack after we pushed EBX, ESP is eight byte below the address where the content of EBX is stored. To POP the proper content into EBX, we have to add these eight byte to ESP. If no registers were saved on the stack, no correction of ESP is required and leave and ret are the only instructions we need to restore ESP and EBP before we return to the caller.

Return To Caller

In our example, ret copies the return address to the session manager of the operating system to EIP and continues execution of its code. The session manager closes our program, frees still allocated resources and passes the content we stored in EAX to the command interpreter.

What Did We Learn?

All introduced mechanisms are, except the size of the stack frame and the amount of preserved registers, identical for all functions following the C-conventions. As far as I know, all of the known operating systems follow these coventions. The advantages are obvious. Once coded functions are portable from one version to the next or from one operating system to another. In the latter case, minor changes must be applied if functions of the target OS have other names or await parameters in different order. Putting it all together, conventions simplify re-use and porting of existing code. But - every object in our universe has at least two (or more) sides. The disadvantages of the introduced methods, mechanisms and conventions are not as obvious as the advantages, because they are hidden in the deepest and darkest parts of the machine room. Only experienced, well trained technicians are able to find out why the machine chokes and does not run as fast as expected. Time to analyse what C-conventions do on assembly language level.


Caveats

As mentioned before, conventions, methods and techniques introduced with the programming language C have a lot of advantages over monolithic code we used to write applications for DOS and other ancient operating systems. All functions easily are portable to other operating systems or processor architectures and can be used multiple times, leading to versatile applications for multiple platforms. Unfortunately, after the C conventions were established, they never were updated or revised to keep pace with the development of processor architectures. If we compare an 8086 against a recent quad-core Athlon, a picturesque comparison were a cart dragged by a tired ox versus an Airbus A380. While no person with sane mind ever wasted one thought about equipping an A380 with a harness and motivate it to lift off with reins, whip and loud 'hee' and 'ho' shouts, software designers all over the world practise such weird things every day. They still create 'flying' stack frames, use slow leave, pop and push instructions with the obligatory update of the stack pointer instead of the faster mov and continue to abuse valuable resources by using EBP as base pointer. This reduces the too small register set of the x86 architecture by 1/7th, forcing the programmer to use slow memory reads and writes instead of much faster register operations.

To demonstrate the caveats, we analyse a dialog procedure taken from an existing ST-Open program. The dialog consists of four groups of radiobuttons, the three buttons 'Abort', 'Move', 'Help' and some static texts. DLGtxt() sets all texts in this dialog to strings taken from a subfield with the current language (multi-lingual dialog and menu texts are an integral part of ST-Open's libraries). The dialog is used to move datasets within a datafield, the target is selected with the radiobuttons.
.text
        .p2align 4,,15
        .globl   MoveDlg
MoveDlg:pushl    %ebp
        movl     %esp, %ebp
        pushl    %esi
        pushl    %ebx
        movl     12(%ebp), %edx
        movl     8(%ebp), %esi
        movl     16(%ebp), %ecx
        movl     20(%ebp), %ebx
        cmpl     $48, %edx
        je       L12
        ja       L39
        cmpl     $32, %edx
        je       L4

    L37:movl     %ebx, 20(%ebp)
        movl     %ecx, 16(%ebp)
        movl     %edx, 12(%ebp)
        movl     %esi, 8(%ebp)
        leal     -8(%ebp), %esp
        popl     %ebx
        popl     %esi
        popl     %ebp
        jmp      _DefDP
Compared against GCC 2.6.1. (1993), GCC 3.3.5. (2006) generates much worse code, spiced with a lot of counterproductive, superfluous instructions. Writing back parameters to the stack violates some basic rules. First, there's absolutely no need to write data back to the locations we took them from some instructions before. Secondly, writes to stack locations above ESP violate all rules of proper programming. If any function starts to write data to the stack frames of other functions, it is just a question of time until we encounter desastrous malfunctions.
.p2align 4,,7
     L4:movl     %ecx, %eax
        andl     $65535, %eax
        cmpl     $4658,  %eax
        je       L7
        jg       L11
        cmpl     $4657, %eax
        je       L6
Because we only need the low word stored in message parameter 1, it was a good idea to extract this lower word with the instruction movzwl 0x10(%ebp),%ecx rather than to waste valuable clock cycles with the code sequence shown above. Recent processors have three, not just one execution pipe(s). The choosen way sends two execution pipes to sleep while one pipe is busy to extract data from a register. This is repeated two times. We could switch off two execution pipes while this code is executed, because we created two avoidable dependencies.
L9:movl     %ebx, 20(%ebp)
        movl     %ecx, 16(%ebp)
        movl     %edx, 12(%ebp)
        movl     %esi, 8(%ebp)
        leal     -8(%ebp), %esp
        popl     %ebx
        popl     %esi
        popl     %ebp
        jmp      _DefDP
Obviously, L37 and L9 provide identical code. Using our brain, these eighteen redundant (potentially pointless) lines could be reduced to five really necessary instructions.
L6:movl     _GVAR, %eax        # [1]
        subl     $12, %esp
        movl     $0, 10464(%eax)

    L42:pushl    %esi
        call     _WinDD

    L41:addl     $16, %esp
        .p2align 4,,7
Up to seven nops are executed every time we branch to this part of code. It is a good way to slow down execution flow as much as possible.
L2:leal     -8(%ebp), %esp
        xorl     %eax, %eax
        popl     %ebx
        popl     %esi
        popl     %ebp
        ret

    L11:cmpl     $4659, %eax
        jne      L9
        subl     $12, %esp
        pushl    $17
        call     _Help
        jmp      L41
        
     L7:pushl    %edx
        pushl    %edx
        pushl    $18
        movl     _GVAR, %eax        # [1]
        addl     $10464, %eax
        pushl    %eax
        call     _FlgS
        popl     %eax
        jmp      L42
The branch prediction logic assumes every first branch as false if there is no entry in its internal table. Almost all of the above branches trigger the obligatory ten penalty cycles, because the wrong branch instructions were chosen.
.p2align 4,,7
    L39:cmpl    $59, %edx
        jne      L37
        pushl    $233
        pushl    $211
        pushl    $210
        pushl    %esi
        call     _DLGtxt
        movl     $0, (%esp)
        pushl    $-1
        pushl    $288
        pushl    $4672
        pushl    %esi
        call     _SnDIM
        addl     $20, %esp
        pushl    $0
        pushl    $-1
        pushl    $288
        pushl    $4680
        pushl    %esi
        call     _SnDIM
        addl     $20, %esp
        pushl    $0
        pushl    $-1
        pushl    $288
        pushl    $4688
        pushl    %esi
        call     _SnDIM
        addl     $20, %esp
        pushl    $0
        pushl    $-1
        pushl    $288
        pushl    $4696
        pushl    %esi
        call     _SnDIM
        addl     $24, %esp
Only the second parameter changes for the four consecutive calls of SnDIM(). Twelve of these twenty push instructions (60 percent) are redundant.
pushl    %esi
        pushl    $0
        pushl    $2
        pushl    $0
        pushl    $11
        movl     _GVAR, %eax        # [1]
        movl     7252(%eax), %eax
        pushl    %eax
        call     _FDacc
        movl     _GVAR, %eax        # [1]
        addl     $24, %esp
        movl     472(%eax), %ebx
        pushl    %ebx
        pushl    $0
        pushl    $2
        pushl    $4
        pushl    $11
        movl     7252(%eax), %ecx
        pushl    %ecx
        call     _FDacc
Only parameters 3 and 6 change for the both calls of FDacc(). While we are bound to push instructions, there is no way to change just these parameters. We have to push all six parameters, again, because the parameter on top must be changed.
addl     $20, %esp
        movl     _GVAR, %eax        # [1]
        movl     $0, 10464(%eax)
        pushl    %esi
        call     _CtrWn
        movl     %esi, (%esp)
        call     _DlgShow
        jmp      L41

        .p2align 4,,7
    L12:movl     %ecx, %eax
        andl     $65535, %eax
        subl     $4672, %eax
        cmpl     $30, %eax
        ja       L37
        jmp      *L36(,%eax,4)
Again, it were better to extract the low word via one movzwl 0x10(%ebp),%ecx rather than to use the chosen way. Probably, parallel execution was considered to be too fast?
.p2align 2
        .align   2,0xcc
I don't know what the second .align is good for. Any hints? My jump table is too large, because C programmers do not think about side effects like blowing up code while assigning resource IDs to 'straight' numbers. Due to the gaps between those IDs, there are a lot of superfluous entries in this jump table.
L36:.long    L14
        .long    L15
        .long    L16
        .long    L17
        .long    L2
        .long    L37
        .long    L37
        .long    L37
        .long    L18
        .long    L19
        .long    L37
        .long    L37
        .long    L37
        .long    L37
        .long    L37
        .long    L37
        .long    L21
        .long    L23
        .long    L25
        .long    L27
        .long    L29
        .long    L31
        .long    L33
        .long    L37
        .long    L21
        .long    L23
        .long    L25
        .long    L27
        .long    L29
        .long    L31
        .long    L33
By the way: Jump tables belong to the .data, not to the .code segment. AS supports jump tables in the .data segment, so there's no need to violate the recommendations of AMD and LETNi as all versions of GCC do...
L14:movl     _GVAR, %edx        # [1], [2]
        movl     10464(%edx), %eax
        andb     $225, %ah
        orb      $16, %ah
        
        .p2align 4,,7
    L40:movl     %eax, 10464(%edx)
        jmp      L2
        
    L15:movl     _GVAR, %edx        # [1], [2]
        movl     10464(%edx), %eax
        andb     $225, %ah
        orb      $8, %ah
        jmp      L40
        
    L16:movl     _GVAR, %edx        # [1], [2]
        movl     10464(%edx), %eax
        andb     $225, %ah
        orb      $4, %ah
        jmp      L40
        
    L17:movl     _GVAR, %edx        # [1], [2]
        movl     10464(%edx), %eax
        andb     $225, %ah
        orb      $2, %ah
        jmp      L40
        
    L18:movl     _GVAR, %edx        # [1], [2]
        movl     10464(%edx), %eax
        andl     $-385, %eax
        orb      $1, %ah
        jmp      L40
        
    L19:movl     _GVAR, %edx        # [1], [2]
        movl     10464(%edx), %eax
        andl     $-385, %eax
        orb      $-128, %al
        jmp      L40
        
    L21:movl     _GVAR, %edx        # [1], [2]
        movl     10464(%edx), %eax
        andl     $-128, %eax
        orl      $64, %eax
        jmp      L40
        
    L23:movl     _GVAR, %edx        # [1], [2]
        movl     10464(%edx), %eax
        andl     $-384, %eax
        orl      $32, %eax
        jmp      L40
        
    L25:movl     _GVAR, %edx        # [1], [2]
        movl     10464(%edx), %eax
        andl     $-384, %eax
        orl      $16, %eax
        jmp      L40
        
    L27:movl     _GVAR, %edx        # [1], [2]
        movl     10464(%edx), %eax
        andl     $-384, %eax
        orl      $8, %eax
        jmp      L40
        
    L29:movl     _GVAR, %edx        # [1], [2]
        movl     10464(%edx), %eax
        andl     $-384, %eax
        orl      $4, %eax
        jmp      L40

    L31:movl     _GVAR, %edx        # [1], [2]
        movl     10464(%edx), %eax
        andl     $-384, %eax
        orl      $2, %eax
        jmp      L40
        
    L33:movl     _GVAR, %edx        # [1], [2]
        movl     10464(%edx), %eax
        andl     $-384, %eax
        orl      $1, %eax
        jmp      L40
Many redundant instructions unnecessarily blow up code size. The first two lines of all jump targets could be reduced to a common read before the distributor branches to the selected table entry.
.comm   _hab,       16
         .comm   _DEBUG,     16
         .comm   _USE_LDF,   16
         .comm   _LDR_AVAIL, 16
         .comm   _MSGLD,     16
         .comm   _BMM,       16
         .comm   _BNR,       16
         .comm   _GVAR,      16
         .comm   _BST,       16
         .comm   _BBF,       16
         .comm   _TST,       16
         .comm   _MHSTR,     16
         .comm   _LDF,       16
         .comm   _DUMPLINE,  16
         .comm   _DUMPCNT,   16
         .comm   _OLH_MODE,  16
         .comm   _SEC,       16
         .comm   _XXX,       16
         .comm   _FLD_XXX,   16
         .comm   _FLD_SEC,   16
We only use _GVAR in this file - enumerating all global variables is quite stupid. All globals are defined as doublewords (32 bit), so we might ask why GCC expands all these variables to a size of 128 bit, filling up the .data segment with garbage.

Footnotes

[1] This is the typical side effect of the C convention saying 'Thou shalt not save ECX and EDX'. Using two of six registers as a sanitary fill for temporary data, we reduce the set of safe registers to three, forcing us to reload frequently required parameters from memory over and over again. Reloading parameters more than two times eats up the advantage of omitting the two push and pop instructions to save and restore the content of ECX and EDX. After the third reload operation, we are on the losing side. Adding unnecessary loads of parameters to our code slows down our functions and does not speed them up.

[2] Loading _GVAR into EDX and the flags MVP_FLAGS into EAX could be done in L12. Applying some brain didn't just save 24 lines of code, it also sped up the 13 target functions, because loading and evaluating MVP_FLAGS were separated and the dependency chain were reduced to two single rather than three consecutive dependencies.


Improvements

Obviously, the code generated by GCC 3.3.5. is anything else than optimised. Even if you do not know anything about reading source code, you surely are able to grasp what those comments say. This document is not an introduction to programming, so you have to rely on my words, but you can be sure: I definitely know what I am talking about. Rearranging some parts and reducing code sequences to really required instructions does shrink GCC's draft markably. Applying some human brain, the remaining (optimised) code should run about 30 percent faster now.
.data
         .p2align 4,0x00
     jt0:.long    L02
         .long    L03
         .long    L04
         .long    L05
         .long    L16
         .long    L15
         .long    L15
         .long    L15
         .long    L06
         .long    L07
         .long    L15
         .long    L15
         .long    L15
         .long    L15
         .long    L15
         .long    L15
         .long    L08
         .long    L09
         .long    L10
         .long    L11
         .long    L12
         .long    L13
         .long    L14
         .long    L15
         .long    L08
         .long    L09
         .long    L10
         .long    L11
         .long    L12
         .long    L13
         .long    L14
Following recommendations of AMD and LETNi, the jump table is moved to the .data segment. This is much better than mixing code and data in the .code segment.
.text
         .p2align 4,,7
         .globl   MoveDlg
 MoveDlg:pushl    %ebp
         movl     %esp,%ebp
         pushl    %edi
         pushl    %esi
         movl     0x08(%ebp),%edi
         movl     0x0C(%ebp),%eax
         movzwl   0x10(%ebp),%ecx
         movl     _GVAR,%esi
         cmpl     $0x30,%eax
         je       L01
         cmpl     $0x20,%eax
         je       L00
         cmpl     $0x3B,%eax
         jne      L15
The distributor was optimised for the branch prediction logic. WM_CONTROL was put on top of the distributor, because most sent messages are WM_CONTROL messages. WM_COMMAND only is sent if the user pushes a button. No user is able to recognise delays of about 5 ns, so we can live with a ten cycles penalty if a branch target is misprediced. WM_INITDLG is sent only once. While the 1st comparison is 'guessed' as not taken, the branch does not trigger a penalty. The 2nd and all following comparisons are assumed to be taken, so the branch to the default routine (DefDP()) does not trigger penalties, as well.
pushl    $0xE9
         pushl    $0xD3
         pushl    $0xD2
         pushl    %edi
         call     _DLGtxt
         pushl    $0x00
         pushl    $-0x01
         pushl    $0x0120
         pushl    $0x1240
         pushl    %edi
         call     _SnDIM
         addl     $0x08,%esp
         pushl    $0x1248
         pushl    %edi
         call     _SnDIM
         addl     $0x08,%esp
         pushl    $0x1250
         pushl    %edi
         call     _SnDIM
         addl     $0x08,%esp
         pushl    $0x1258
         pushl    %edi
         call     _SnDIM
The three parameters on top are pushed for the first call, only. This saves nine redundant instructions.
addl     $0x14,%esp
         movl     0x1C54(%esi),%ecx
         movl     0x01D8(%esi),%edx
Both parameters can be preloaded at this point, because FDacc() is a function taken from ST-Open's library. Functions in my libraries restore all registers (including ECX and EDX) by default - they are 'clean'. But - watch out: MoveDlg() is a function following the C conventions. ECX and EDX neither are saved nor restored - MoveDlg() is a 'dirty' function.
pushl    %edi
         pushl    $0x00
         pushl    $0x02
         pushl    $0x00
         pushl    $0x0B
         pushl    %ecx
         call     _FDacc
         pushl    %edx
         pushl    $0x00
         pushl    $0x02
         pushl    $0x04
         pushl    $0x0B
         pushl    %ecx
         call     _FDacc
         addl     $0x36,%esp
         movl     $0x00,0x28E0(%esi)
         pushl    %edi
         call     _CtrWn
         call     _DlgShow
         jmp      3f
The next distributor was optimised for code reduction. Because none of the three buttons is pushed more than once (in general..), we can live with a ten cycles penalty for one or two mispredicted branch(es). The delay added by the penalty is at least six powers of ten faster than anything human senses could perceive.
L00:subl     $0x1231,%ecx
         je       0f
         decl     %ecx
         je       1f
         decl     %ecx
         jne      L15
         pushl    $0x11
         call     _Help
         jmp      3f
       0:movl     $0x00,0x28E0(%esi)
         jmp      2f
       1:orl      $0x00040000,0x28E0(%esi)
       2:pushl    %edi
         call     _WinDD
       3:addl     $0x04,%esp
         jmp      L16

     L01:subl     $0x1240,%ecx
         js     L15
         cmpl     $0x1E,%ecx
         ja     L15
         movl     0x28E0(%esi),%eax
         jmp    *jt0(, %ecx, 4)
The jump table was moved to the .data segment. To keep an overwiew, your symbols for jump tables generally should be marked with special names. ST-Open uses the symbol 'jtX', where X is the number of the current jump table.
L02:andl     $0xFFFFE1FF,%eax
         orl      $0x1000,%eax
         jmp      0f
     L03:andl     $0xFFFFE1FF,%eax
         orl      $0x0800,%eax
         jmp      0f
     L04:andl     $0xFFFFE1FF,%eax
         orl      $0x0400,%eax
         jmp      0f
     L05:andl     $0xFFFFE1FF,%eax
         orl      $0x0200,%eax
         jmp      0f
     L06:andl     $0xFFFFFE7F,%eax
         orl      $0x0100,%eax
         jmp      0f
     L07:andl     $0xFFFFFE7F,%eax
         orl      $0x80,%eax
         jmp      0f
     L08:andl     $0xFFFFFF80,%eax
         orl      $0x40,%eax
         jmp      0f
     L09:andl     $0xFFFFFE80,%eax
         orl      $0x20,%eax
         jmp      0f
     L10:andl     $0xFFFFFE80,%eax
         orl      $0x10,%eax
         jmp      0f
     L11:andl     $0xFFFFFE80,%eax
         orl      $0x08,%eax
         jmp      0f
     L12:andl     $0xFFFFFE80,%eax
         orl      $0x04,%eax
         jmp      0f
     L13:andl     $0xFFFFFE80,%eax
         orl      $0x02,%eax
         jmp      0f
     L14:andl     $0xFFFFFE80,%eax
         orl      $0x01,%eax
       0:movl     %eax,0x28E0(%esi)
         jmp      L16
This part surely could be reduced further if I could remember what all those flags are good for...
L15:popl     %ebx
         popl     %edi
         popl     %esi
         popl     %ebp
         jmp      _DefDP

     L16:xorl     %eax, %eax
         popl     %ebx
         popl     %edi
         popl     %esi
         popl     %ebp
         ret
'Exits' belong to the bottom of a function. First, the processor does not have to jump back and forth to random locations within the instruction chain. Secondly, human senses perceive structured (sorted) input much faster than random patterns spread all over the screen.
.comm    _GVAR, 4
One (used) out of all (unused). The size of the global variable was reset to the proper size of 32 bit, so four - rather than one - variable(s) will fit into one paragraph (16 byte), again.


Analysis

The source code generated by GCC 3.3.5 perfectly reveals the greatest weaknesses of the C conventions. The first step to slow down C code markably is the abuse of EBP as base pointer, reducing our set of available registers to 6: EAX, EBX, ECX, EDX, EDI and ESI. One of these remaining registers, EAX, is used to pass results or error codes from a called function to the calling function, reducing our register set to five. By default, ECX and EDX neither are saved nor restored, so every time we call another function their content is sent to the great formatter, or, in other words: If we stored frequently used parameters in ECX or EDX prior to the call, they probably will be overwritten by the called function. Due to counterproductive conventions, only three registers are left to store our frequently used parameters. If four or more parameters are required throughout a function, parameters 4 and up must be reloaded whenever we call another function, because the content of EAX, ECX and EDX is changed by the called function. Reloading parameters from the slower memory subsystem instead of preloading them in registers to perform much faster operations is quite time consuming. It is not about those five additional clock cycles we waste with reloading a parameter over and over again. The most negative side-effect is the immediate interruption of parallel execution, forcing two execution pipes to idle until the required parameter was reloaded, again. It's one of the reasons why C and C++ applications are that slow. An exemplatory sequence is the following code snippet taken from GCC's output. It demonstrates how strict implementation of the C conventions slows down the execution of standard C and C++ code:
...
         movl     _GVAR, %eax
         movl     7252(%eax), %eax
         pushl    %eax
         ...
This sequence turns any parallel execution off. Instead of executing up to 3 instructions in one of the 3 available execution pipes, the code listed above is executed sequentially, or, in other words: Two execution pipes have to wait until the previous instruction is executed and its result is available. If all data are present in L1 cache, it takes (approximately) nine clock cycles to execute the three lines listed above. Two execution pipes are idling while _GVAR (_GVAR is BNR defined as an array of dwords to satisfy GCC) is loaded into EAX. Next, two pipes are idling while 1C54[BNR] is loaded into EAX. Finally, EAX is pushed onto the stack, blocking ESP for two clock cycles. If some data were not loaded into L1 cache, yet, execution time is extended markably. Reading data from the L2 cache costs about six clock cycles, loads from main memory consume about 27 clock cycles. In the end, our primary intention, saving 14 clock cycles to push and pop ECX and EDX in every function's prologue and epilogue, turns out to be a bad idea. In the best case, the assumed advantage is eaten up after the second reload of a frequently used parameter. In general, there is no advantage at all - The saved 14 clock cycles are wasted with the first reloading from main memory. This is the main reason why conventional software is creeping through the execution pipes of John Doe's hypermodern Gigantium processor like thick dough. Perhaps it were a good idea to purify the sap called software to give John Doe the chance to partially enjoy the overwhelming computational powers of his expensive Gigantium machine? Another obstacle to unleash the powers of recent processors is the flying stack pointer ESP. Modern software design should use the capabilities of modern processors instead of blocking them with outdated mechanisms and ancient techniques. A pointlessly floating stack pointer with random content cannot be used seriously to address stack elements.


Considerations

As discussed in great detail, any conventional stack frame generally looks like this:





Programmers and compilers are free to write any kind of data to stack locations between the stack's bottom and the stack element below ESP (-0x04[ESP]). To reserve a part of the stack for the private use of the current function, we have to subtract the required size from ESP. Moving ESP towards stack bottom, we reserve the corresponding area for our private use. No function (except it is compiled with GCC 3.3.5.) ever writes data to stack locations above -04[ESP]. It is very important to understand the connection between the subtraction of the required stack frame size from ESP and exclusive ownership of the reserved area. My alternative method introduced in this paper as well as conventional programming techniques are based on this principle. You just break an absolute taboo if you write to stack locations above -04[ESP]! Because conventional programming methods push data onto the stack, respectively pop them from the stack, the content of ESP is changing continuously. Due to its randomly changing content, it is impossible to use ESP as base for adressing specific stack elements directly. An additional register, the base pointer EBP, is required for this task. It points to the current top of the stack all of the time, so we safely can adress all stack elements regardless of the current content of ESP. If we analyse the described process and have a closer look at its details, one thing is quite obvious: We could save all time consuming contortions with additional registers if we replaced the usual push - call - add n,%esp sequenzes with something else. A short look into the data sheets of modern processors reveals us some additional caveats. Let us recapitulate what we know about push and pop instructions and compare them against a bunch of simple mov instructions:

PUSH




Depending on the standard data size, either two, four or eight is subtracted from the stack pointer rSP, setting it to the next lower stack element. After updating rSP, the content of a register, a memory location or an immediate value is copied to the current stack element (00[rSP]). A push reg or push imm instruction is executed within three clock cycles. ESP is blocked for the time it needs to update its content and deliver the stack location where the given data shall be stored. This requires two clock cycles, so consecutive push instructions have a latency of at least two clock cycles. As an alternative, we might replace one push with three mov instructions. If none of them depends on the result of one of the other two movs, all of them are executed simultaneously in three clock cycles, as well. If we write to continuous memory locations, all writes are done in one gulp, because a mechanism called write combining is triggered.

POP




The content of the current stack element is copied to the register or memory location specified by the pop instruction, then two, four or eight is added to the stack pointer rSP. The current stack element was 'taken from the stack' and the stack pointer now points to one stack element above (our new stack bottom). Other than the direct path push, pop is a vector path instruction. These instructions always are fed to execution pipes 0 and 1, while execution pipe 2 is blocked while the instruction is executed. The latency of each pop instruction is 4 clock cycles. As an alternative, we might replace one pop with four mov instructions. Three of them are executed simultanuously in three clock cycles, the fourth mov starts execution in the fourth clock cycle.

Conclusions

As the analysis of both instructions shows, the only rational consequence is to replace all pop and push with mov instructions. Doing so not only speeds up passing of parameters, it also turns the flying stack pointer into a static one. As a positive side-effect of the frozen stack pointer, we do not need EBP as base pointer any longer, freeing one valuable register for general purposes. Using no base pointer, we get rid of that mess with positive and negative offsets to EBP - a source of errors, making source code less readable.

Addendum

With Athlon64, family 10h (aka Phenom), the latency for pop instructions was reduced to three clock cycles direct path single. Nonetheless, replacing push and pop with mov instructions is a much better improvement, because of the expected positive side-effects coming along with the replacement automatically. Even if Intelligent Design functions will win any race around clock cycles with ease, its concept offers much more improvements than just counting clocks.


Intelligent Design

If we put all discussed aspects together, we can state that conventional programming techniques ignore improvements introduced with modern processor design completely. Well, conclusions of our analysis suggested to replace all leave, pop and push with mov instructions, because three of them can be executed simultaneously. The advantages of these replacements are a static stack pointer, an additional general purpose register (EBP), a naturally aligned stack (if supported by the operating system), read and write access to any stack element via ESP, and much more. Based on the facts we worked out until now, a concept was developed and tested for usability in everday's applications. Because we use no base pointer any longer, the stack is addressed via ESP, only. As shown before, we can reserve a stack area for our private use by subtracting the required size from the stack pointer ESP. The reserved area is safe from being used by called functions (except those compiled by GCC 3.3.5.). Let us recapitulate how the stack looks like after a function is called:





The current stack element holds the return address - the address of the instruction following the call in the calling function. If the calling function has to pass parameters to our funktion, they are stored above the current stack element in ascending order. Because we want to use mov instructions, only, a sufficient stack area must be reserved where we can save used registers, store local variables and pass parameters to called functions. Working with a static stack pointer, we just have to add the required sizes of these three parts (registers, local area and parameters), then subtract the sum from ESP:





After this subtraction, the content of ESP does not change until our function is terminated. Therefore, ESP can be used to address any element within our current stack frame. Elements outside our private area are taboo - you may read them whenever you want, but you must not write to them under any circumstances!

Calculating The Required Size

The size of our stack frame depends on the amount of registers we have to save, the size for our local variables and the amount of parameters we have to pass.

Registers

The contents of all used registers are stored at the top of our private area. All Intelligent Design functions do save ECX and EDX, as well! The following table shows the required size for 16, 32 and 64 bit functions:


Registers16 Bit32 Bit64 Bit
 10x020x040x08
 20x040x080x10
 30x060x0C0x18
 40x080x100x20
 50x0A0x140x28
 60x0C0x180x30
 70x0E0x1C0x38
 8--0x40
 9--0x48
10--0x50
11--0x58
12--0x60
13--0x68
14--0x70
15--0x78


If no registers are used, the required size is zero, of course. Rows 7 through 15 are not available in 16 and 32 bit mode, because these modes do not recognise the extended 64 bit register set. Eventually, you might need to save some MMX or XMM registers. Just add 8 for each MMX and 16 for each XMM register. It is recommended to avoid using MMX registers, because they internally are mirrored on the original FPU registers ST(0) through ST(7). Older software libraries and operating systems do not know anything about MMX registers, so you might encounter weird problems if you do not write an explicit emms or femms before you call external functions. Well, emms and femms generally destroy the contents of all MMX registers, so it is a good idea to use XMM instead of MMX registers.

Local Variables

Between saved registers at the top of our stack and the area where parameters for called functions are stored, we have to reserve the space required for local data (variables, structures, strings). If the size is no multiple of the standard data size, it is recommended to expand the required size to the next higher multiple of the standard data size. It's very important to keep the stack aligned, because odd addresses in ESP trigger penalty cycles for every unaligned memory access. If you don't keep care and add an even size in your epilogue, the program definitely will crash, because EIP probably is set to a faulty return address. If you want to store strings on the stack, it is recommended to define a sufficient size, so the function reading characters from the input device has no chance to overwrite parameters or register contents stored in the stack elements above.

Parameters

Parameters are stored at the bottom of our stack frame. They are moved to the stack elements in ascending order, starting at 00[ESP]. Because we mov parameters to the corresponding stack elements rather than to push them any longer, we should reserve an area equal to the size required by the function with the largest amount of parameters. If we have three functions awaiting three parameters and one function awaiting five parameters, we need a size of 20 byte (32 bit code) to pass five parameters - the three parameters for the other three functions automatically fit into the reserved 20 byte. The required size can be taken from the above table for registers. If more than 6 (15) parameters must be passed, the size easily can be calculated with the formula parameters * datasize, where data size is 2 (16 bit), 4 (32 bit) or 8 (64 bit). For example, the required size for the 13 parameters we have to pass to WinCreateWindow() is 13 * 4 = 52 byte. The bottom of the corresponding stack frame looks like this:





Watch out: Parameter 1 in figure 08 is identical with the leftmost parameter we are writing within the paranthesis of a C or C++ function. In general, parameters are moved onto the stack in ascending order: Parameter 1 @ 00[ESP], parameter 2 @ 04[ESP] and so on, until the last parameter was moved. If any parameter does not change between two or more consecutive call instructions, it just has to be moved the very first time it is used. For following calls, the parameter already is stored at the right place, so we do not need to store it at the same location, again - it is stored there until we overwrite it!

Adding Sizes

After we determined the required sizes of all three parts (registers, local area, parameters to pass), we calculate their sum, then round the result up to a size ending with 3C, 7C, BC or FC (32 bit mode), respective 38, 78, B8 or F8 (64 bit mode). Applying this trick, our stack frame automatically is aligned to the beginning of the next cache line. Finally, we create our new stack frame by subtracting the rounded up sum from rSP. As described in detail, this subtraction reserves the created stack frame from being overwritten by other functions. Writing data to stack locations outside the current stack frame violates the rules of conventional programming as well as the rulework defined for Intelligent Design. Setting the first stack element to the beginning of a cache line, we benefit from a bunch of advantages. First, no time consuming workarounds are required to align stack locations to store or load XMM registers. It is just one side effect of the Intelligent Design prologue that rSP is naturally aligned by default without a single line of additional code. Secondly, Intelligent Design code is able to benefit from bandwith and accellerating mechanisms provided by modern processors. At the latest, the first access to any stack element loads the stack frame into the L1 cache, moving all following accesses to the fastest area in the machine's memory hierarchy. Thirdly, writing data to continuous memory locations in ascending order triggers a mechanism called write combining, where up to 16(!) doublewords are stored on the stack in one gulp. Much more advantages like avoiding repetitive stores for never changed parameters come along with this fresh and modern design. Some of them are shown later on, some of them are not revealed, yet... If all Intelligent Design rules were applied properly, the smallest possible stack frame automatically looks like this in 32 bit code





or like this in 64 bit code





The real size can be varied by adding multiples of 64 to the smallest value of 0x3C (0x38 in 64 bit code) to match your individual requirements perfectly. Due to the subtraction of a predefined size from rSP, the stack frames of all Intelligent Design functions automatically are set to the beginning of a cache line. In general, functions are called. Looking at the mechanism of the call instruction, the return address is pushed onto the stack before execution is continued with the first instruction of the called function. Hence, rSP points to an address ending with 3C, 7C, BC or FC in 32 bit code, respective 38, 78, B8 or F8 in 64 bit code - exactly the size we have to subtract from rSP.

Destroying The Stack Frame

After our function's code was executed, all registers must be restored to the content they had as they were passed to us. The last step before returning to the calling function is to add the size we subtracted to create our stack frame to rSP. With the final ret, rIP is taken from the stack and execution is continued with the instruction following the call of our function. At this point, our stack looks like this, again:




Because the area between stack bottom and the address currently loaded into rSP is 'no-mans-land', the next function is free to reserve a part of it to save used registers, to store its local variables and structures or to put some parameters for called functions onto the stack. On principle, the content of a stack element is undefined until something was written to it. Writing data into untouched stack elements is called 'initialisation' in C and C++ terminology.

Side Notes About Alignment

Using XMM instructions, most 128 bit stores and loads explicitely expect addresses aligned to 16 byte boundaries, in other words: The last digit of the address must be zero. No existing operating system cares about aligned addresses at all. Using excessive push orgies to put the parameters for called functions onto the stack, the chance to enter that function with an aligned stack pointer is about 25 percent. Because 16 / 4 = 4, the chance ESP ends with 04, 08 or 0C is about 75 percent. It's left to the programmer or compiler to handle this problem properly, so you have to add redundant, time consuming code to align the stack pointer in every function working with XMM instructions.

As shown, Intelligent Design provides some mechanisms to naturally align all stack frames to the beginning of a cache line, but - even the best mechanism is of no use if the operating system does not support it. In most cases, an ID compliant function is called by the message transporting system within the operating system. Whenever a message is sent or posted to our application's message loop, we do not know which instance of which function of the operating system called our message loop. Hence, we do not know how many things were pushed onto our stack nor can we rely on something like an aligned stack pointer. Even if we aligned the stack pointer at the begin of our main() function, it surely is not aligned any longer if the code in our message loop is executed the very first time. Therefore, one of the most important improvents introduced with Intelligent Design is not available as long as we run our code on any existing platform. Because all existing operating systems put dirt into our gears, no machine ever will run at full speed.

No application programmer is able to do anything against this counter-productive behaviour of existing operating systems. Either we bow our heads, obey and apply all those pointless work-arounds recommended by processor manufacturers, or we give up programming to get rid of them. Because it is a fact that work-arounds are not very practicable in many cases, there's a third way as an alternative for this 'take it or leave it' choice. Instead of searching for ways to bypass 'built in' restrictions, speed brakes and other obstacles of existing operating systems, it might be a much better idea to throw away these remains of old fashioned software design and create something new. Something supporting innovations and capabilities of modern processors instead of ignoring them. Something making use of the immense computational power and speed of recent quad core machines. Something called IDEOS (the Intelligent Design Easy2Use Operating System)...


Summary

I hope I could convince you of various advantages introduced with the development of Intelligent Design. Compared against conventional programming techniques (as discussed in this document), Intelligent Design wins in all disciplines - be it sheer speed, high code density, simplified development or efficiency. Due to the strict renunciation of slow leave, pop and push instructions, coming along with many ill side-effects like an unpredictable stack pointer, these caveats of conventional software design easily are turned into some really welcome side-effects by replacing those slow instructions with simple moves.

The most welcome side-effect surely is the fact that we get EBP back as a general purpose register. One additional register is quite a lot if only six of them are available, because EBP was abused as base pointer. Each additional register is an advantage, because we can pair reads to copy parameters from memory to registers and access these registers rather than reloading single registers with some frequently used parameters over and over again.

Another advantage is that we mov parameters to the right location rather than to push them onto the stack. While move instructions can store changed parameters anywhere, a push always addresses the next lower stack element (a quite limited range). If we call a function repeatedly, and only the last parameter changes from call to call, we had to push all parameters after each call to update this topmost parameter with the current data. Moreover, the correction of ESP is mandatory after each call, blocking the following push for an entire clock cycle. Using move instructions, all these problems vanish. We save many unnecessary instructions and keep the three execution pipes busy most of the time. Of course, there are a lot of cases where only one or two instructions are required between two calls, but this still is faster than a conventional push orgy.

Yet another advantage is the fact we can move parameters and other data onto the stack anywhere within our source file. Pairing multiple reads in groups, we can avoid direct dependencies completely. Pairing multiple writes, we trigger write combining. Keeping proper distances between reads and writes can eliminate most dependencies with few exceptions. Applying these tricks as often as possible speeds up code markably.

Putting it all together, Intelligent Design is the up-to-date alternative to some old-fashioned conventional programming techniques. Conventional software design did not change too much in the last thirty years. We neither can create the most simple integrated circuit with stone age tools, nor are we able to create software making full use of the capabilities provided by modern processors with conventional code. While old fashioned code is spiced with mandatory brake pads and compilers generate tons of counter-productive dependencies, because the convention allows to abuse the most valuable resources - our registers - as garbage pile to save four instructions at the cost of multiple reloads, Intelligent Design replaces these obstacles with straight and clean rules, providing full support for features and improvements of recent processors.

Intelligent Design is an up-to-date concept for the next generation of operating systems and applications, introducing a new quality to software design. Moreover, it is much easier to handle a single stack pointer than to manage two registers, a separate stack and base pointer, simultaneously. Getting rid of the institution called base pointer, we finally can say good bye to positive and negative offsets and error prone corrections of the stack pointer after each of the obligatory push orgies. Another positive side-effect is the return of a valuable resource (EBP) to the very sparse pool of general purpose registers - the worst conceptual flaw of LETNi's x86 architecture. All in all, Intelligent Design points the way ahead to fastest code with high density, causing less head aches for stressed programmers. Best of all: This is not just a theoretical construction, floating around in the head of a weird, unwordly developer - it exists for real! ST-Open's libraries as well as DatTools and the SRE editor were ported to Intelligent Design, working as expected: Really fast and reliable.


Rules

Whenever we want to build a greater whole out of many independent parts created by individual contributors, we have to use a common set of rules. Anyone contributing a part to the whole has to obey these rules. If all contributors applied their own rules, we finally ended up with a lot of parts, but none of them were able to work together with any other part. Using different interfaces, single parts either could not communicate at all or misunderstood received messages completely. Therefore, we need a set of rules before we start to create parts of a greater whole. This set of rules, like a common language, defines a common interface, forcing any contributor to use standardised communication protocols, so other parts are able to understand all sent messages and do the right work. The big idea behind any set of rules lives or dies with strict obedience.

Rule 1

The instructions enter, leave, push and pop are archaic remains of the past era. They must not be used in Intelligent Design code.

Any of these instructions change the content of the stack pointer automatically. All Intelligent Design functions depend on one principle: The stack pointer must not change its content for their entire runtime. We either can enjoy the advantages of a static stack pointer or we can waste precious time with obligatory corrections and repetitive writes of one and the same parameter. Using a 'flying' stack pointer and its urgently required compagnion called base pointer is mutual exclusive with Intelligent Design. Both concepts use the stack pointer in a very different way - there is no chance to use a mix of both concepts in one and the same function.

Rule 2

Stack frames are created by subtracting 0x3C in 32 bit functions or 0x38 in 64 bit functions from the stack pointer ESP/RSP. If larger stack frames are required, the basic value 0x3C or 0x38 can be expanded in 64 byte steps (0x7C, 0xBC, 0xFC, ... in 32 bit functions, 0x78, 0xB8, 0xF8, ... in 64 bit functions).

This rule is valid for IDEOS, only. Any other operating system will not benefit from the advantages of a properly aligned stack. Because the stack generally is not aligned in old fashioned operating systems, you have to align it with some lines of redundant bloat if you use XMM registers. If you write IDEOS code, you have to apply this rule. The only exception is the rare case that you do not use other registers than rAX and XMM0 through XMM3. If there are registers to preserve, you need a stack frame. The advantage of this rule is a properly aligned stack pointer, starting at the beginning of a cache line. IDEOS's concept guarantees that the stack pointer is naturally aligned to a multiple of 64 before any function is called. Because the call instruction stores EIP/RIP in the next stack element, subtracting the correct value automatically aligns the stack pointer to a multiple of 64 and the corresponding cache line is preloaded into L1 cache without a single line of additional code.

Rule 3

All registers used in a function must be saved before they are overwritten and must be restored before returning to the caller.

Any software running on modern processors is most efficient if it can access as much hardware resources as possible. The most valuable resources of a x86 processor are its registers. The more frequently used parameters are held in registers, the less time is required to perform a given task. If we reload parameters with slow memory reads rather than to keep them in registers, we slow down our code with superfluous operations and interrupt simultaneous execution completely. Preloading all required parameters into registers in one gulp reduces all dependencies, supporting parallel execution, keeping all three pipes busy all of the time. Saving precious clock cycles with much faster register operations outweighs the cost of the few clock cycles needed to save and restore all used registers by far. Many functions work with a subset of all available registers. The less registers we use, the faster they are saved and restored. Another welcome side-effekt is based on the fact that continuous writes to ascending addresses trigger write combining, where up to 16 doublewords are written to a cache line in one gulp. This surely is much faster than 16 single writes. Moreover, pairing of multiple reads or writes is a good opportunity to support simultaneous execition flow as long as possible.

Rule 4

The registers RAX, XMM0, XMM1, XMM2 and XMM3 are reserved for special purposes - they neither are saved nor restored by default.

This rule partially overrides rule 3: Per definitionem, Register RAX is used to pass return values or errorcodes to the calling function. It does not make sense to save a register that is overwritten in all epilogues by default. If a function declaration defines its return value as VOID, rAX must be cleared with a simple xor %rAX,%rAX on exit.

Registers XMM0 through XMM3 with their size of 4 * 128 bit = 64 byte cover an entire cache line. They are used to clear, move or manipulate large data areas in steps of 64 byte, preferably aligned to a multiple of 64. In general, these operations are executed in small loops where no other functions are called, so we can skip to save and restore these registers per se. This saves a lot of clock cycles if your application makes heavy use of such operations. If your function calls others while manipulating large memory blocks, the called function might use XMM0 through XMM3 as temporary storage, overwriting whatever was stored in these registers. Hence, it is recommended to use XMM4 through XMM15 if preloaded values must not change while other functions are called. As stipulated by rule 3, you have to save and restore XMM4 through XMM15 if you use them.

Rule 5

Accessing registers is much faster than accessing memory locations. The most frequently used parameters in a function should be preloaded into registers as soon as these registers were saved on the stack.

Even if old-fashioned programmers aren't used to work with multiple execution pipes: Obeying to this rule can turn an ox cart into a rocket (see Rule 3).

Rule 6

Modern processors execute three (LETNi: 1.5) instructions simultaneously. Good code should make use of these capabilities rather than to ignore them. Dependency chains interrupt simultaneous execution. While one instruction is executed in one pipe, the other two pipes are waiting for the result. Sending two of three pipes to sleep for some clocks wastes two thirds of the available resources.

This extraordinary product of a classical compiler
...
    pushl    %esi
    pushl    $0
    pushl    $2
    pushl    $0
    pushl    $11
    movl     _GVAR, %eax
    movl     7252(%eax), %eax
    pushl    %eax
    call     _FDacc
    movl     _GVAR, %eax
    addl     $24, %esp
    movl     472(%eax), %ebx
    pushl    %ebx
    pushl    $0
    pushl    $2
    pushl    $4
    pushl    $11
    movl     7252(%eax), %ecx
    pushl    %ecx
    call     _FDacc
    addl     $20, %esp
    movl     _GVAR, %eax
    movl     $0, 10464(%eax)
    ...
can be improved with a few useful changes
...
    movl _GVAR,%esi    # at least 3 clocks ahead
    ...
    ...
    ...
    movl 0x1C54(%esi),%eax
    movl 0x01D8(%esi),%ecx
    movl $0x00,0x28E0(%esi)
    movl %eax,0x00(%esp)
    movl $0x0B,0x04(%esp)
    movl $0x00,0x08(%esp)
    movl $0x02,0x0C(%esp)
    movl $0x00,0x10(%esp)
    movl %edi,0x14(%esp)
    call _FDacc
    movl $0x04,0x08(%esp)
    movl %ecx,0x14(%esp)
    call _FDacc
    ...
to run at least twice as fast, now. Removing all superfluous instructions, code size could be reduced from 23 to 14 instructions. Placing all instructions in the proper order removes dependencies and keeps all pipes busy without delays.

Rule 7

Jump tables belong to the data segment. AS is able to build jump tables in the data segment, so there's absolutely no reason to pollute the code segment with data of any kind.

No comment is required for this rule. Just do it!


Appendices

AT&T-Syntax
Appendix 1 - Examples
Appendix 2 - Optimisations