800 464 7928 800 swb isdn 52.85 54.70 250.00 *125* 52.85 DirectX - using Blt() to do clears is much much faster than memset. On many cards the clear can happen async and doesn't take much time. Even if you call lock() immediately after Blt() it's much faster. - Blt's to word alligned boundaries are much faster (~1.5x faster) - limit surface locks() & unlocks(), I went form 204fps to 232fps by limiting locks to 1 lock per frame - Implementing a software z-buffer can be accelerated by DirectX by creating the Z-buffer as a directx Surface and using Blt to clear the surface. This is a little tricky but you can get your clears for much less time. If you are using a floating point z-buffer, I recommend creating a 16bit surface that is (width * 2, height), because creating a 32bit surface may not possible clear the high-byte (considered alpha). It is safest to clear the Z buffer to 0 so that the card does not do any transformations on the color value. This means you have to negate your z-values and reverse you z-compare function. Threaded Applications - for threaded applications, don't use EnterCriticalSection & LeaveCriticalSection section these take about 60cycles to execute (together). Instead use intel's "bts" instruction which will will cost 6cycles inlined, and 12 cycles if called (Does anyone know if you need to prefix bts with a "lock" for a multi-processor environment?) ********* example replacement class ***** class critical_section_lock { public: int flag; tlock() { flag=0; } void __fastcall lock(); void __fastcall unlock(); }; void __fastcall critical_section_lock::unlock() { __asm mov [ecx], 0 } void __fastcall critical_section_lock::lock() { __asm { start: bts [ecx], 0 jnc success call thread_yield // give up our time-slice jmp start success: } } General Suggestions - converting a float to an int using C-casting under Visual C is very slow because Visual C does a "safe" conversion which involves calling a function changes the FPU registers to set the currect rounding mode and then does a fistp - the meat of the operation, restores the FPU and return. Most of the time the FPU will already be in the correct state so using a simple inlined assembly function like this will save a lot of time: inline int long ftoi(float f) { int res; __asm { fld f fistp res } return res; } - when timing functions under MSDEV turn off incremental linking. incremental linking often adds an extra "jmp" for every "call" which is used to patch together your code without having to re-layout everything. This adds 2-3 extra clocks to every function you call. - for small functions use the __fastcall function declaration and for C++, use [ecx] instead of "this" using "this" will cause the compiler to generate "push ebp, mov ebp, esp" pairs which may not be needed for simple get/set type functions Profiling & tuning - start with small and work your way up.