0% found this document useful (0 votes)
1K views551 pages

Game Programming Gems II

hgfh

Uploaded by

Sourav Sharma
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views551 pages

Game Programming Gems II

hgfh

Uploaded by

Sourav Sharma
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 551

GameGems II

Converted by Borz borzpro @yahoo .com 2002.12.01

1.1
m
"'

Optimization for C++ Games G


Andrew Kirmse, LucasArts Entertainment
[email protected]

ell-written C++ games are often more maintainable and reusable than their plain C counterparts arebut is it worth it? Can complex C++ programs hope to match traditional C programs in speed? With a good compiler and thorough knowledge of the language, it is indeed possible to create efficient games in C++. This gem describes techniques you can use to speed up games in particular. It assumes that you're already convinced of the benefits of using C++, and that you're familiar with the general principles of optimization (see Further Investigations for these). One general principle that merits repeating is the absolute importance of profiling. In the absence of profiling, programmers tend to make two types of mistakes. First, they optimize the wrong code. The great majority of a program is not performance critical, so any time spent speeding it up is wasted. Intuition about which code is performance critical is untrustworthyonly by direct measurement can you be sure. Second, programmers sometimes make "optimizations" that actually slow down the code. This is particularly a problem in C++, where a deceptively simple line can actually generate a significant amount of machine code. Examine your compiler's output, and profile often.

Object Construction and Destruction


The creation and destruction of objects is a central concept in C++, and is the main area where the compiler generates code "behind your back." Poorly designed programs can spend substantial time calling constructors, copying objects, and generating costly temporary objects. Fortunately, common sense and a few simple rules can make object-heavy code run within a hair's breadth of the speed of C. Delay construction of objects until they're needed. The fastest code is that which never runs; why create an object if you're not going to use it? Thus, in the following code: void Function(int arg)

Section 1 General Programming


Object obj; if (arg *= 0) return;

even when arg is zero, we pay the cost of calling Object's constructor and destructor. If arg is often zero, and especially if Object itself allocates memory, this waste can add up in a hurry. The solution, of course, is to move the declaration of obj until after the //check. Be careful about declaring nontrivial objects in loops, however. If you delay construction of an object until it's needed in a loop, you'll pay for the construction and destruction of the object on every iteration. It's better to declare the object before the loop and pay these costs only once. If a function is called inside an inner loop, and the function creates an object on the stack, you could instead create the object outside the loop and pass it by reference to the function. Use initializer lists. Consider the following class: class Vehicle { public: Vehicle(const std::string &name) // Don't do this! { mName = name; } private: std: : string mName; Because member variables are constructed before the body of the constructor is invoked, this code calls the constructor for the string mName, and then calls the = operator to copy in the object's name. What's particularly bad about this example is that the default constructor for string may well allocate memory in fact, more memory than may be necessary to hold the actual name assigned to the variable in the constructor for Vehicle. The following code is much better, and avoids the call to operator =. Further, given more information (in this case, the actual string to be stored), the nondefault string constructor can often be more efficient, and the compiler may be able to optimize away the Vehicle constructor invocation when the body is empty: class Vehicle { public: Vehicle(const std::string &name) : mName(name)
{ } private:

1.1 Optimization for C++ Games std::string mName;

Prefer preincrement to postincrement. The problem with writing x = y++ is that the increment function has to make a copy of the original value of y, increment y, and then return the original value. Thus, postincrement involves the construction of a temporary object, while preincrement doesn't. For integers, there's no additional overhead, but for userdefined types, this is wasteful. You should use preincrement whenever you have the option. You almost always have the option in for loop iterators. Avoid operators that return by value. The canonical way to write vector addition in C++ is this:
Vector operator+(const Vector &v1, const Vector &v2)

This operator must return a new Vector object, and furthermore, it must return it by value. While this allows useful and readable expressions like v = v 1 + z>2, the cost of a temporary construction and a Vector copy is usually too much for something called as often as vector addition. It's sometimes possible to arrange code so that the compiler is able to optimize away the temporary object (this is known as the "return value optimization"), but in general, it's better to swallow your pride and write the slightly uglier, but usually faster:
void Vector::Add(const Vector &v1, const Vector &v2)

Note that operator+= doesn't suffer from the same problem, as it modifies its first argument in place, and doesn't need to return a temporary. Thus, you should use operators like += instead of + when possible. Use lightweight constructors. Should the constructor for the Vector class in the previous example initialize its elements to zero? This may come in handy in a few spots in your code, but it forces every caller to pay the price of the initialization, whether they use it or not. In particular, temporary vectors and member variables will implicitly incur the extra cost. A good compiler may well optimize away some of the extra code, but why take the chance? As a general rule, you want an object's constructor to initialize each of its member variables, because uninitialized data can lead to subtle bugs. However, in small classes that are frequently instantiated, especially as temporaries, you should be prepared to compromise this rule for performance. Prime candidates in many games are the Vector and Matrix classes. These classes should provide mediods (or alternate constructors) to set themselves to zero and the identity, respectively, but the default constructor should be empty.

Section 1 General Programming

As a corollary to this principle, you should provide additional constructors to classes where this will improve performance. If the Vehicle class in our second example were instead written like this: class Vehicle { . public: Vehicle ()

void SetName(const std: :string &name)


{

mName = name;

private: std: : string mName; we'd incur the cost of constructing mName, and then setting it again later via SetName(). Similarly, it's cheaper to use copy constructors than to construct an object and then call operator=. Prefer constructing an object this way Vehicle vl(v2) to this way Vehicle vl; vl = v2;. If you want to prevent the compiler from automatically copying an object for you, declare a private copy constructor and operator= for the object's class, but don't implement either function. Any attempt to copy the object will then result in a compile-time error. Also get into the habit of declaring single-argument constructors as explicit, unless you mean to use them as type conversions. This prevents the compiler from generating hidden temporary objects when converting types. Preallocate and cache objects. A game will typically have a few classes that it allocates and frees frequently, such as weapons or particles. In a C game, you'd typically allocate a big array up front and use them as necessary. With a little planning, you can do the same thing in C++. The idea is that instead of continually constructing and destructing objects, you request new ones and return old ones to a cache. The cache can be implemented as a template, so that it works for any class, provided that the class has a default constructor. Code for a sample cache class template is on the accompanying CD. You can either allocate objects to fill the cache as you need them, or preallocate all of the objects up front. If, in addition, you maintain a stack discipline on the objects (meaning that before you delete object X, you first delete all objects allocated after X), you can allocate the cache in a contiguous block of memory.

1.1 Optimization for C++Games

Memory Management

C++ applications generally need to be more aware of the details of memory management than C applications do. In C, all allocations are explicit though mallocQ and freeQ, while C++ can implicitly allocate memory while constructing temporary objects and member variables. Most C++ games (like most C games) will require their own memory manager. Because a C++ game is likely to perform many allocations, it must be especially careful about fragmenting the heap. One option is to take one of the traditional approaches: either don't allocate any memory at all after the game starts up, or maintain a large contiguous block of memory that is periodically freed (between levels, for example). On modern machines, such draconian measures are not necessary, if you're willing to be vigilant about your memory usage. The first step is to override the global new and delete operators. Use custom implementations of diese operators to redirect the game's most common allocations away from mallocQ and into preallocated blocks of memory. For example, if you find that you have at most 10,000 4-byte allocations outstanding at any one time, you should allocate 40,000 bytes up front and issue blocks out as necessary. To keep track of which blocks are free, maintain a. free list by pointing each free block to the next free block. On allocation, remove the front block from the list, and on deallocation, add the freed block to the front again. Figure 1.1.1 illustrates how the free list of small blocks might wind its way through a contiguous larger block after a sequence of allocations and frees.

used

free

t
used used free

free

~ .~
FIGURE 1.1.1 A linked free list.

_ _

A.

You'll typically find that a game has many small, short-lived allocations, and thus you'll want to reserve space for many small blocks. Reserving many larger blocks wastes a substantial amount of memory for those blocks that are not currently in use; above a certain size, you'll want to pass allocations off to a separate large block allocator, or just to mallocQ.

Virtual Functions
Critics of C++ in games often point to virtual functions as a mysterious feature that drains performance. Conceptually, the mechanism is simple. To generate a virtual function call on an object, the compiler accesses the objects virtual function table,

10

Section 1

General Programming

retrieves a pointer to the member function, sets up the call, and jumps to the member function's address. This is to be compared with a function call in C, where the compiler sets up the call and jumps to a fixed address. The extra overhead for the virtual function call is die indirection to die virtual function table; because the address of the call isn't known in advance, there can also be a penalty for missing the processor's instruction cache. Any substantial C++ program will make heavy use of virtual functions, so the idea is to avoid these calls in performance-critical areas. Here is a typical example: class BaseClass { public: virtual char *GetPointer() = 0;

};
class Class"! : public BaseClass { virtual char *GetPointer();

>;

class Class2 : public BaseClass { virtual char *GetPointer(); } | void Function(BaseClass *pObj) { char *ptr = pObj->GetPointer(); } If FunctionQ is performance critical, we want to change die call to GetPointer from virtual to inline. One way to do this is to add a new protected data member to BaseClass, which is returned by an inline version of GetPointerQ, and set the data member in each class: class BaseClass { public: inline char *GetPointerFast() { return mpPointer; } protected: inline void SetPointer(char *pData) { mpData = pData; } private: char *mpData;

1.1

Optimization for C++Games

11

// classl and class2 call SetPointer as necessary //in member functions void Function(BaseClass *pObj) {
}

char *ptr = pObj->GetPointerFast();

A more drastic measure is to rearrange your class hierarchy. If Classl and Class2 have only slight differences, it might be worth combining them into a single class, with a flag indicating whether you want the class to behave like Classl or Class2 at runtime. With this change (and the removal of the pure virtual BaseClass), the GetPointer function in the previous example can again be made inline. This transformation is far from elegant, but in inner loops on machines with small caches, you'd be willing to do much worse to get rid of a virtual function call. Although each new virtual function adds only the size of a pointer to a per-class table (usually a negligible cost), the yzrtf virtual function in a class requires a pointer to the virtual function table on a pet-object basis. This means that you don't want to have any virtual functions at all in small, frequently used classes where this extra overhead is unacceptable. Because inheritance generally requires the use of one or more virtual functions (a virtual destructor if nothing else), you don't want any hierarchy for small, heavily used objects. Code Size Compilers have a somewhat deserved reputation for generating bloated code for C++. Because memory is limited, and because small is fast, it's important to make your executable as small as possible. The first thing to do is get the compiler on your side. If your compiler stores debugging information in the executable, disable the generation of debugging information. (Note that Microsoft Visual C++ stores debugging information separate from the executable, so this may not be necessary.) Exception handling generates extra code; get rid of as much exception-generating code as possible. Make sure the linker is configured to strip out unused functions and classes. Enable the compiler's highest level of optimization, and try setting it to optimize for size instead of speedsometimes this actually produces faster code because of better instruction cache coherency. (Be sure to verify that intrinsic functions are still enabled if you use this setting.) Get rid of all of your space-wasting strings in debugging print statements, and have the compiler combine duplicate constant strings into single instances. Inlining is often the culprit behind suspiciously large functions. Compilers are free to respect or ignore your inline keywords, and they may well inline functions without telling you. This is another reason to keep your constructors lightweight, so that objects on the stack don't wind up generating lots of inline code. Also be careful of overloaded operators; a simple expression like ml = m2 * m3 can generate a ton of

12

Section 1

General Programming

inline code if m2 and m3 are matrices. Get to know your compiler's settings for inlining functions thoroughly. Enabling runtime type information (RTTI) requires the compiler to generate some static information for (just about) every class in your program. RTTI is typically enabled so that code can call dynamic_cast and determine an object's type. Consider avoiding RTTI and dynamic_cast entirely in order to save space (in addition, dynamic_cast is quite expensive in some implementations). Instead, when you really need to have different behavior based on type, add a virtual function that behaves differently. This is better object-oriented design anyway. (Note that this doesn't apply to static_cast, which is just like a C-style cast in performance.)

The Standard Template Library


The Standard Template Library (STL) is a set of templates that implement common data structures and algorithms, such as dynamic arrays (called vectors), sets, and maps. Using the STL can save you a great deal of time that you'd otherwise spend writing and debugging these containers yourself. Once again, though, you need to be aware of the details of your STL implementation if you want maximum efficiency. In order to allow the maximum range of implementations, the STL standard is silent in the area of memory allocation. Each operation on an STL container has certain performance guarantees; for example, insertion into a set takes O(log n) time. However, there are no guarantees on a container's memory usage. Let's go into detail on a very common problem in game development: you want to store a bunch of objects (we'll call it a list of objects, though we won't necessarily store it in an STL list). Usually you want each object to appear in a list only once, so that you don't have to worry about accidentally inserting the object into the collection if it's already there. An STL set ignores duplicates, has O(log n) insertion, deletion, and lookupthe perfect choice, right? Maybe. While it's true that most operations on a set are O(log n), this notation hides a potentially large constant. Although the collection's memory usage is implementation dependent, many implementations are based on a red-black tree, where each node of the tree stores an element of the collection. It's common practice to allocate a node of the tree every time an element is inserted, and to free a node every time an element is removed. Depending on how often you insert and remove elements, the time spent in the memory allocator can overshadow any algorithmic savings you gained from using a set. An alternative solution uses an STL vector to store elements. A vector is guaranteed to have amortized constant-time insertion at the end of the collection. What this means in practice is that a vector typically reallocates memory only on occasion, say, doubling its size whenever it's full. When using a vector to store a list of unique elements, you first check the vector to see if the element is already there, and if it isn't, you add it to the back. Checking the entire vector will take O(n) time, but the constant involved is likely to be small. That's because all of the elements of a vector are

1.1 Optimization for C++Games

13

typically stored contiguously in memory, so checking the entire vector is a cachefriendly operation. Checking an entire set may well thrash the memory cache, as individual elements of the red-black tree could be scattered all over memory. Also consider that a set must maintain a significant amount of overhead to set up the tree. If all you're storing is object pointers, a set can easily require three to four times the memory of a vector to store the same objects. Deletion from a set is O(log n), which seems fast until you consider that it probably also involves a call to free(). Deletion from a vector is O(n), because everything from the deleted element to the end of the vector must be copied over one position. However, if the elements of the vector are just pointers, the copying can all be done in a single call to memcpyO, which is typically very fast. (This is one reason why it's usually preferable to store pointers to objects in STL collections, as opposed to objects themselves. If you store objects directly, many extra constructors get invoked during operations such as deletion.) If you're still not convinced that sets and maps can often be more trouble than they're worth, consider the cost of iterating over a collection, specifically:
for (Collection::iterator it = collection.begin(); it != collection.end(); ++it)

If Collection is a vector, then ++it is a pointer incrementone machine instruction. But when Collection is a set or a map, ++it involves traversing to the next node of a red-black tree, a relatively complicated operation that is also much more likely to cause a cache miss, because tree nodes may be scattered all over memory. Of course, if you're storing a very large number of items in a collection, and doing lots of membership queries, a set's O(log n) performance could very well be worth the memory cost. Similarly, if you're only using the collection infrequently, the performance difference may be irrelevant. You should do performance measurements to determine what values of n make a set faster. You may be surprised to find that vectors outperform sets for all values that your game will typically use. That's not quite the last word on STL memory usage, however. It's important to know if a collection actually frees its memory when you call the clear() method. If not, memory fragmentation can result. For example, if you start a game with an empty vector, add elements to the vector as the game progresses, and then call clear() when the player restarts, the vector may not actually free its memory at all. The empty vector's memory could still be taking up space somewhere in the heap, fragmenting it. There are two ways around this problem, if indeed your implementation works this way. First, you can call reserveQ when the vector is created, reserving enough space for the maximum number of elements that you'll ever need. If that's impractical, you can explicitly force the vector to free its memory this way: vector<int> v; // ... elements are inserted into v here vector<int>().swap(v); // causes v to free its memory

14

Section 1

General Programming

Sets, lists, and maps typically don't have this problem, because they allocate and free each element separately.

Advanced Features
Just because a language has a feature doesn't mean you have to use it. Seemingly simple features can have very poor performance, while other seemingly complicated features can in fact perform well. The darkest corners of C++ are highly compiler dependent make sure you know the costs before using them. C++ strings are an example of a feature that sounds great on paper, but should be avoided where performance matters. Consider the following code:
void Function (const std: :string &str)

Function ("hello");

The call to FunctionQ invokes a constructor for a string given a const char *. In one commercial implementation, this constructor performs a mallocQ, a strlenQ, and a memcpyO, and the destructor immediately does some nontrivial work (because this implementation's strings are reference counted) followed by a freeQ- The memory that's allocated is basically a waste, because the string "hello" is already in the program's data segment; we've effectively duplicated it in memory. If FunctionQ had instead been declared as taking a const char *, there would be no overhead to the call. That's a high price to pay for the convenience of manipulating strings. Templates are an example of the opposite extreme of efficiency. According to the language standard, the compiler generates code for a template when the template is instantiated with a particular type. In theory, it sounds like a single template declaration would lead to massive amounts of nearly identical code. If you have a vector of Classl pointers, and a vector of Class2 pointers, you'll wind up with two copies of vector in your executable. The reality for most compilers is usually better. First, only template member functions that are actually called have any code generated for them. Second, the compiler is allowed to generate only one copy of the code, if correct behavior is preserved. You'll generally find that in the vector example given previously, only a single copy of code (probably for vector<void *>) will be generated. Given a good compiler, templates give you all the convenience of generic programming, while maintaining high performance. Some features of C++, such as initializer lists and preincrement, generally increase performance, while other features such as overloaded operators and RTTI look equally innocent but carry serious performance penalties. STL collections illustrate how blindly trusting in a function's documented algorithmic running time can lead you astray. Avoid the potentially slow features of the language and libraries, and spend

1.1 Optimization for C++Games

15

some time becoming familiar with the options in your profiler and compiler. You'll quickly learn to design for speed and hunt down the performance problems in your game.

Further Investigations
Thanks to Pete Isensee and Christopher Kirmse for reviewing this gem. Gormen, Thomas, Charles Leiserson, and Ronald Rivest, Introduction to Algorithms, Cambridge, Massachusetts, MIT Press, 1990. Isensee, Peter, C++ Optimization Strategies and Techniques, www.tantalon.com/ pete/cppopt/main.htm. Koenig, Andrew, "Pre- or Postfix Increment," The C++ Report, June, 1999. Meyers, Scott, Effective C++, Second Edition, Reading, Massachusetts: AddisonWesley Publishing Co., 1998. Sutter, Herb, Guru of the Week #54: Using Vector and Deque, www.gotw.ca/ gotw/054.htm.

1.2
Inline Functions Versus Macros
Peter Dalton, Evans & Sutherland
[email protected]

ien it comes to game programming, the need for fast, efficient functions cannot be overstated, especially functions that are executed multiple times per frame. Many programmers rely heavily on macros when dealing with common, time-critical routines because they eliminate the calling/returning sequence required by functions that are sensitive to the overhead of function calls. However, using the tfdefine directive to implement macros diat look like functions is more problematic than it is worth.

Advantages of Inline Functions


Through the use of inline functions, many of the inherent disadvantages of macros can easily be avoided. Take, for example, the following macro definition:
#define max(a,b) ( ( a ) > (b) ? (a) : (b))

Let's look at what would happen if we called the macro with die following parameters: max(++x, y). If x = 5 and j/ = 3, the macro will return a value of 7 rather than the expected value of 6. This illustrates the most common side effect of macros, the fact that expressions passed as arguments can be evaluated more than once. To avoid this problem, we could have used an inline function to accomplish die same goal:
inline int max(int a, int b) { return (a > b ? a : b); }

By using the inline method, we are guaranteed that all parameters will only be evaluated once because they must, by definition, follow all the protocols and type safety enforced on normal functions. Another problem that plagues macros, operator precedence, follows from die same problem presented previously, illustrated in the following macro:
#define square(x) (x*x)

If we were to call this macro with the expression 2+1, it should become obvious that die macro would return a result of 5 instead of the expected 9. The problem here is that the multiplication operator has a higher precedence than the addition operator

16

1.2 Inline Functions Versus Macros

17

has. While wrapping all of the expressions within parentheses would remedy this problem, it could have easily been avoided through the use of inline functions. The other major pitfall surrounding macros has to deal with multiple-statement macros, and guaranteeing that all statements within the macro are executed properly. Again, let's look at a simple macro used to clamp any given number between zero and one: #define clamp(a) \ if (a > 1.0) a = 1.0; \ if (a < 0.0) a = 0.0; If we were to use the macro within the following loop: for (int ii = 0 ; ii < N; ++ii)
clamp( numbersToBeClamped[ii] );

the numbers would not be clamped if they were less than zero. Only upon termination of the for loop when == N would the expression if(numbersToBeClamped[ii] < 0.0) be evaluated. This is also very problematic, because the index variable is now out of range and could easily result is a memory bounds violation that could crash the program. While replacing the macro with an inline function to perform the same functionality is not the only solution, it is the cleanest. Given these inherent disadvantages associated with macros, let's run through the advantages of inline functions: Inline functions follow all the protocols of type safety enforced on normal functions. This ensures that unexpected or invalid parameters are not passed as arguments. Inline functions are specified using the same syntax as any other function, except for the inline keyword in the function declaration. Expressions passed as arguments to inline functions are evaluated prior to entering the function body; thus, expressions are evaluated only once. As shown previously, expressions passed to macros can be evaluated more than once and may result in unsafe and unexpected side effects. It is possible to debug inline functions using debuggers such as Microsoft's Visual C++. This is not possible with macros because the macro is expanded before the parser takes over and the program's symbol tables are created. Inline functions arguably increase the procedure's readability and maintainability because they use the same syntax as regular function calls, yet do not modify parameters unexpectedly. Inline functions also outperform ordinary functions by eliminating the overhead of function calls. This includes tasks such as stack-frame setup, parameter passing, stack-frame restoration, and the returning sequence. Besides these key advantages, inline functions also provide the compiler with the ability to perform improved code

18

Section 1

General Programming

optimizations. By replacing inline functions with code, the inserted code is subject to additional optimizations that would not otherwise be possible, because most compilers do not perform interprocedural optimizations. Allowing the compiler to perform global optimizations such as common subexpression elimination and loop invariant removal can dramatically improve both speed and size. The only limitation to inline functions that is not present within macros is the restriction on parameter types. Macros allow for any possible type to be passed as a parameter; however, inline functions only allow for the specified parameter type in order to enforce type safety. We can overcome this limitation through the use of inline template functions, which allow us to accept any parameter type and enforce type safety, yet still provide all the benefits associated with inline functions.

When to Use Inline Functions


j j ^ . , , . , . . ^ . . . . . , . . . , . . . . , , , . . , . . . . . . . . . . . , , , . . . . . . . . . . . . . . . . , . . , . . , _ , . , _ . . . . . , , , . , . . . . . . . . _ . . , , . . , . . . _ . , , , , . . , . . . . . , , . , .,..,...,..,., ,,,^... . . ; . . , , , . . . . , . . . . . . ^ , , . , . , . , . . , . . , . . .; . , . . , ^ , , . , . . , , , . . . , , , . , . ...s^ ? i [ : r r ...-"!!

Why don't we make every function an inline function? Wouldn't this eliminate the function overhead for the entire program, resulting in faster fill rates and response times? Obviously, the answer to these questions is no. While code expansion can improve speed by eliminating function overhead and allowing for interprocedural compiler optimizations, this is all done at the expense of code size. When examining the performance of a program, two factors need to be weighed: execution speed and the actual code size. Increasing code size takes up more memory, which is a precious commodity, and also bogs down the execution speed. As the memory requirements for a program increase, so does the likelihood of cache misses and page faults. While a cache miss will cause a minor delay, a page fault will always result in a major delay because the virtual memory location is not in physical memory and must be fetched from disk. On a Pentium II 400 MHz desktop machine, a hard page fault will result in an approximately 10 millisecond penalty, or about 4,000,000 CPU cycles [Heller99]. If inline functions are not always a win, then when exactly should we use them? The answer to this question really depends on the situation and thus must rely heavily on the judgment of the programmer. However, here are some guidelines for when inline functions work well: Small methods, such as accessors for private data members. Functions returning state information about an object. Small functions, typically three lines or less. Small functions that are called repeatedly; for example, within a time-critical rendering loop.

Longer functions that spend proportionately less time in the calling/returning sequence will benefit less from inlining. However, used correctly, inlining can greatly increase procedure performance.

1.2 Inline Functions Versus Macros

19

When to Use Macros


Despite the problems associated with macros, there are a few circumstances in which they are invaluable. For example, macros can be used to create small pseudo-languages that can be quite powerful. A set of macros can provide the framework that makes creating state machines a breeze, while being very debuggable and bulletproof. For an excellent example of this technique, refer to the "Designing a General Robust AI Engine" article referenced at the end of this gem [RabinOO]. Another example might be printing enumerated types to the screen. For example:
tfdefine CaseEnum(a) case(a) : PrintEnum( #a ) switch (msg_passed_in) { CaseEnum( MSG_YouWereHit ); ReactToHit(); break; CaseEnum( MSG_GameReset ); ResetGameLogic(); break; }

Here, PrintEnumQ is a macro that prints a string to the screen. The # is the stringizing operator that converts macro parameters to string constants [MSDN]. Thus, there is no need to create a look-up table of all enums to strings (which are usually poorly maintained) in order to retrieve invaluable debug information. The key to avoiding the problems associated with macros is, first, to understand the problems, and, second, to know the alternative implementations.

Microsoft Specifics
Besides the standard inline keyword, Microsoft's Visual C++ compiler provides support for two additional keywords. The inline keyword instructs the compiler to generate a cost/benefit analysis and to only inline the function if it proves beneficial. The forceinline keyword instructs the compiler to always inline the function. Despite using these keywords, there are certain circumstances in which the compiler cannot comply as noted by Microsoft's documentation [MSDN].

References
[Heller99] Heller, Martin, Developing Optimized Code with Microsoft Visual C++ 6.0, Microsoft MSDN Library, January 2000. [McConnell93] McConnell, Steve, Code Complete, Microsoft Press, 1993. [MSDN] Microsoft Developer Network Library, http://msdn.microsoft.com. [Myers98] Myers, Scott, Effective C++, Second Edition, Addison-Wesley Longman, Inc., 1998. [RabinOO] Rabin, Steve, "Designing a General Robust AI Engine," Game Programming Gems. Charles River Media, 2000; pp. 221-236.

1.3
Programming with Abstract Interfaces
Noel Llopis, Meyer/Glass Interactive
[email protected]

he concept of abstract interfaces is simple yet powerful. It allows us to completely separate the interface from its implementation. This has some very useful consequences: It is easy to switch among different implementations for the code without affecting the rest of the game. This is particularly useful when experimenting with different algorithms, or for changing implementations on different platforms. The implementations can be changed at runtime. For example, if the graphics Tenderer is implemented through an abstract interface, it is possible to choose between a software Tenderer or a hardware-accelerated one while the game is running. The implementation details are completely hidden from the user of the interface. This will result in fewer header files included all over the project, faster recompile times, and fewer times when die whole project needs to be completely recompiled. New implementations of existing interfaces can be added to the game effortlessly, and potentially even after it has been compiled and released. This makes it possible to easily extend the game by providing updates or user-defined modifications.

Abstract Interfaces
In C++, an abstract interface is nothing more than a base class that has only public pure virtual functions. A pure virtual function is a type of virtual member function that has no implementation. Any derived class must implement those functions, or else the compiler prevents instantiaton of that class. Pure virtual functions are indicated by adding = 0 after their declaration. The following is an example of an abstract interface for a minimal sound system. This interface would be declared in a header file by itself: / / I n SoundSystem.h class ISoundSystem { public:

20

1.3 Programming with Abstract Interfaces virtual ~ISoundSystem() {}; virtual bool PlaySound ( handle hSound ) = 0; virtual bool StopSound ( handle hSound ) = 0;

21

The abstract interface provides no implementation whatsoever. All it does is define the rules by which the rest of the world may use the sound system. As long as the users of the interface know about ISoundSystem, they can use any sound system implementation we provide. The following header file shows an example of an implementation of the previous interface: / / I n SoundSystemSoftware.h #include "SoundSystem.h" class SoundSystemSoftware : public ISoundSystem { public: virtual -SoundSystemSoftware () ; virtual bool PlaySound ( handle hSound ) ; virtual bool StopSound ( handle hSound ) ; // The rest of the functions in the implementation
};

We would obviously need to provide the actual implementation for each of those functions in the corresponding .cpp file. To use this class, you would have to do the following:
ISoundSystem * pSoundSystem = new SoundSystemSoftware () ;
// Now w e ' r e ready to use it

pSoundSystem->PlaySound ( hSound );

So, what have we accomplished by creating our sound system in this roundabout way? Almost everything that we promised at the start: It is easy to create another implementation of the sound system (maybe a hardware version). All that is needed is to create a new class that inherits from ISoundSystem, instantiate it instead of SoundSystemSoftwareQ, and everything else will work the same way without any more changes. We can switch between the two classes at runtime. As long as pSoundSystem points to a valid object, the rest of the program doesn't know which one it is using, so we can change them at will. Obviously, we have to be careful with specific class restrictions. For example, some classes will keep some state information or require initialization before being used for the first time. We have hidden all the implementation details from the user. By implementing the interface we are committed to providing the documented behavior no matter what our implementation is. The code is much cleaner than the equivalent code

22

Section 1 General Programming full of //"statements checking for one type of sound system or another. Maintaining the code is also much easier.

Adding a Factory
There is one detail that we haven't covered yet: we haven't completely hidden the specific implementations from the users. After all, the users are still doing a new on the class of the specific implementation they want to use. The problem with this is that they need to #include the header file with the declaration of the implementation. Unfortunately, the way C++ was designed, when users #include a header file, they can also get a lot of extra information on the implementation details of that class that they should know nothing about. They will see all the private and protected members, and they might even include extra header files that are only used in the implementation of the class. To make matters worse, the users of the interface now know exactly what type of class their interface pointer points to, and they could be tempted to cast it to its real type to access some "special features" or rely on some implementation-specific behavior. As soon as this happens, we lose many of the benefits we gained by structuring our design into abstract interfaces, so this is something that should be avoided as much as possible. The solution is to use an abstract factory [Gamma95], which is a class whose sole purpose is to instantiate a specific implementation for an interface when asked for it. The following is an example of a basic factory for our sound system: / / I n SoundSystemFactory.h class ISoundSystem; class SoundSystemFactory { public: enum SoundSystemType {
SOUND_SOFTWARE, SOUND_HARDWARE, SOUND_SOMETH I NGE LSE

};
static ISoundSystem * CreateSoundSystem(SoundSystemType type);

/ / I n SoundSystemFactory. cpp ^include "SoundSystemSof tware . h" ^include "SoundSystemHardware . h" #include "SoundSYstemSomethingElse . h" ISoundSystem * SoundSystemFactory: :CreateSoundSystem ( SoundSystemType _type ) { ISoundSystem * pSystem;

1.3 Programming with Abstract Interfaces switch ( type ) {


case SOUND_SOFTWARE:

23

pSystem = new SoundSystemSoftwaref);


break; case SOUND_HARDWARE:

pSystem = new SoundSystemHardwareO;


break; case SOUND_SOMETHINGELSE:

pSystem = new SoundSystemSomethingElse(); break; default: pSystem = NULL;

return pSystem;

Now we have solved the problem. The user need only include SoundSystemFactory. h and SoundSystem.h. As a matter of fact, we don't even have to make the rest of die header files available. To use a specific sound system, the user can now write:
ISoundSystem * pSoundSystem; pSoundSystem = SoundSystemFactory::CreateSoundSystem (SoundSystemFactory::SOUND_SOFTWARE); // Now we're ready to use it pSoundSystem->PlaySound ( hSound );

We need to always include a virtual destructor in our abstract interfaces. If we don't, C++ will automatically generate a nonvirtual destructor, which will cause the real destructor of our specific implementation not to be called (and that is usually a hard bug to track down). Unlike normal member functions, we can't just provide a pure virtual destructor, so we need to create an empty function to keep the compiler happy.

Abstract Interfaces as Traits


A slightly different way to think of abstract interfaces is to consider an interface as a set of behaviors. If a class implements an interface, that class is making a promise that it will behave in certain ways. For example, the following is an interface used by objects that can be rendered to the screen: class IRenderable { public: virtual -IRenderable() {}; virtual bool Render () = 0; We can design a class to represent 3D objects that inherits from IRenderable and provides its own method to render itself on the screen. Similarly, we could have a

Section 1 General Programming

terrain class that also inherits from IRenderable and provides a completely different rendering method.
class GenericSDObject : public IRenderable { public: virtual ~Generic3DObject() ; virtual bool Render(); // Rest of the functions here

};

The render loop will iterate through all the objects, and if they can be rendered, it calls their RenderQ function. The real power of the interface comes again from hiding the real implementation from the interface: now it is possible to add a completely new type of object, and as long as it presents the IRenderable interface, the rendering loop will be able to render it like any other object. Without abstract interfaces, the render loop would have to know about the specific types of object (generic 3D object, terrain, and so on) and decide whether to call their particular render functions. Creating a new type of render-capable object would require changing the render loop along with many other parts of the code. We can check whether an object inherits from IRenderable to know if it can be rendered. Unfortunately, that requires that the compiler's RTTI (Run Time Type Identification) option be turned on when the code is compiled. There is usually a performance and memory cost to have RTTI enabled, so many games have it turned off in their projects. We could use our own custom RTTI, but instead, let's go the way of COM (Microsoft's Component Object Model) and provide a Querylnterface function [Rogerson97] . If the object in question implements a particular interface, then Querylnterface casts the incoming pointer to the interface and returns true. To create our own QueryInterface function, we need to have a base class from which all of the related objects that inherit from a set of interfaces derive. We could even make that base class itself an interface like COM's lUnknown, but that makes things more complicated.
class GameObject { public:
enum GamelnterfaceType IRENDERABLE, IOTHERINTERFACE {

virtual bool Querylnterface (const GamelnterfaceType type, void ** pObj ) ; // The rest of the GameObject declaration

The implementation of Querylnterface for a plain game object would be trivial. Because it's not implementing any interface, it will always return false.

1.3 Programming with Abstract Interfaces bool GameObject: :QueryInterface (const GamelnterfaceType type, void ** pObj ) { return false;

25

The implementation of a 3D object class is different from that of GameObject, because it will implement the IRenderable interface. class 3DObject : public GameObject, public IRenderable { public: virtual -3DObject(); virtual bool Querylnterface (const GamelnterfaceType type, void ** pObj ) ; virtual bool Render(); // Some more functions if needed bool SDObject: :QueryInterface (const GamelnterfaceType type, void ** pObj ) { bool bSuccess = false; if ( type == GameObject:: IRENDERABLE ) { *pObj = static_cast<IRenderable *>(this); bSuccess = true;
}

return bSuccess; It is the responsibility of the 3DObject class to override Querylnterface, check for what interfaces it supports, and do the appropriate casting. Now, let's look at the render loop, which is simple and flexible and knows nothing about the type of objects it is rendering. IRenderable * pRenderable; for ( all the objects we want to render ) { if ( pGameObject->QueryInterface (GameObject: : IRENDERABLE, (void**)&pRenderable) )
{

pRenderable->Render ( ) ;

Now we're ready to deliver the last of the promises of abstract interfaces listed at the beginning of this gem: effortlessly adding new implementations. With such a render loop, if we give it new types of objects and some of them implemented the IRenderable interface, everything would work as expected without the need to change the render loop. The easiest way to introduce the new object types would be to simply relink the project with the updated libraries or code that contains the new classes. Although beyond the scope of this gem, we could add new types of objects at runtime through DLLs or an equivalent mechanism available on the target platform. This enhancement would allow us to release new game objects or game updates without

26

Section 1

General Programming

the need to patch the executable. Users could also use this method to easily create modifications for our game. Notice that nothing is stopping us from inheriting from multiple interfaces. All it will mean is that the class that inherits from multiple interfaces is now providing all the services specified by each of the interfaces. For example, we could have an ICollidable interface for objects that need to have collision detection done. A 3D object could inherit from both IRenderable and ICollidable, but a class representing smoke would only inherit from IRenderable. A word of warning, however: while using multiple abstract interfaces is a powerful technique, it can also lead to overly complicated designs that don't provide any advantages over designs with single inheritance. Also, multiple inheritance doesn't work well for dynamic characteristics, and should rather be used for permanent characteristics intrinsic to an object. Even though many people advise staying away from multiple inheritance, this is a case where it is useful and it does not have any major drawbacks. Inheriting from at most one real parent class and multiple interface functions should not result in the dreaded diamond-shaped inheritance tree (where the parents of both our parents are the same class) or many of the other usual drawbacks of multiple inheritance.

Everything Has a Cost


So far, we have seen that abstract interfaces have many attractive features. However, all of these features come at a price. Most of the time, the advantages of using abstract interfaces outweigh any potential problems, but it is important to be aware of the drawbacks and limitations of this technique. First, the design becomes more complex. For someone not used to abstract interfaces, the extra classes and the querying of interfaces could look confusing at first sight. It should only be used where it makes a difference, not indiscriminately all over the game; otherwise, it will only obscure the design and get in the way. With the abstract interfaces, we did such a good job hiding all of the private implementations that they actually can become harder to debug. If all we have is a variable of type IRenderable*, we won't be able to see the private contents of the real object it points to in the debugger's interactive watch window without a lot of tedious casting. On the other hand, most of the time we shouldn't have to worry about it. Because the implementation is well isolated and tested by itself, all we should care about is using the interface correctly. Another disadvantage is that it is not possible to extend an existing abstract interface through inheritance. Going back to our first example, maybe we would have liked to extend the SoundSystemHardware class to add a few functions specific to the game. Unfortunately, we don't have access to the class implementation any more, and we certainly can't inherit from it and extend it. It is still possible either to modify the existing interface or provide a new interface using a derived class, but it will all have to be done from the implementation side, and not from within the game code.

1.3 Programming with Abstract Interfaces

27

Finally, notice that every single function in an abstract interface is a virtual function. This means that every time one of these functions is called through the abstract interface, the computer will have to go through one extra level of indirection. This is typically not a problem with modern computers and game consoles, as long as we avoid using interfaces for functions that are called from within inner loops. For example, creating an interface with a DrawPolygonQ or SetScreenPointQ function would probably not be a good idea. Conclusion Abstract interfaces are a powerful technique that can be put to good use with very little overhead or structural changes. It is important to know how it can be best used, and when it is better to do things a different way. Perfect candidates for abstract interfaces are modules that can be replaced (graphics Tenderers, spatial databases, AI behaviors), or any sort of pluggable or user-extendable modules (tool extensions, game behaviors). References [Gamma95] Gamma, Eric et al, Design Patterns, Addison-Wesley, 1995. [Lakos96] Lakos, John, Large Scale C++ Software Design, Addison-Wesley, 1996. [Rogerson97] Rogerson, Dale, Inside COM. Microsoft Press, 1997.

1.4
Exporting C++ Classes from DLLs
Herb Marselas, Ensemble Studios
[email protected]

xporting a C++ class from a Dynamic Link Library (DLL) for use by another application is an easy way to encapsulate instanced functionality or to share derivable functionality without having to share the source code of the exported class. This method is in some ways similar to Microsoft COM, but is lighter weight, easier to derive from, and provides a simpler interface.

Exporting a Function
At the most basic level, there is little difference between exporting a function or a class from a DLL. To export myExportedFunction from a DLL, the value _BUILDING_ MY_DLL is defined in the preprocessor options of the DLL project, and not in the projects that use the DLL. This causes DLLFUNCTION to be replaced by __decbpec(dllexport) when building the DLL, and __deckpec(dllimport) when building the projects that use the DLL.
#ifdef _BUILDING_MY_DLL

tfdefine DLLFUNCTION _declspec(dllexport) // defined if building the // DLL #else tfdefine DLLFUNCTION _declspec(dllimport) // defined if building the // application #endif DLLFUNCTION long myExportedFunction(void);

Exporting a Class
Exporting a C++ class from a DLL is slightly more complicated because there are several alternatives. In the simplest case, the class itself is exported. As before, the DLLFUNCTION macro is used to declare the class exported by the DLL, or imported by the application.

28

1.4 Exporting C++ Classes from DLLs

29

tfifdef

_BUILDING_MY_DLL

tfdefine DLLFUNCTION _declspec(dllexport) #else tfdefine DLLFUNCTION _ declspec(dllimport) tfendif

class DLLFUNCTION CMyExportedClass { public: CMyExportedClass(void) : mdwValue(O) { } void setValue(long dwValue) { mdwValue = dwValue; } long getValue(void) { return mdwValue; } long clearValue(void) ; private: long mdwValue; If the DLL containing the class is implicitly linked (in other words, the project links with the DLL's lib file), then using the class is as simple as declaring an instance of the class CMyExportedClass. This also enables derivation from this class as if it were declared directly in the application. The declaration of a derived class in the application is made normally without any additional declarations.
class CMyApplicationClass : public CMyExportedClass { public: CMyApplicationClass ( void )

There is one potential problem with declaring or allocating a class exported from a DLL in an application: it may confuse some memory-tracking programs and cause them to misreport memory allocations or deletions. To fix this problem, helper functions that allocate and destroy instances of the exported class must be added to the DLL. All users of the exported class should call the allocation function to create an instance of it, and the deletion function to destroy it. Of course, the drawback to this is that it prevents deriving from the exported class in the application. If deriving an application-side class from the exported class is important, and the project uses a memory-tracking program, then this program will either need to understand what's going on or be replaced by a new memory-tracking program.

30
#ifdef _BUILDING_MY_DLL

Section 1 General Programming

#define DLLFUNCTION _declspec(dllexport) #else #define DLLFUNCTION _declspec(dllimport) #endif

class DLLFUNCTION CMyExportedClass { public: CMyExportedClass(void) : mdwValue(O) { } void setValue(long dwValue) { mdwValue = dwValue; } long getValue(void) { return mdwValue; } long clearValue(void); private: long mdwValue;

};
CMyExportedClass *createMyExportedClass(void) { return new CMyExportedClass; } void deleteMyExportedClass(CMyExportedClass *pclass) { delete pclass; }

Exporting Class Member Functions


Even with the helper functions added, because the class itself is being exported from the DLL, it is still possible that users could create instances of the class without calling the createMyExportedCLtss helper function. This problem is easily solved by moving the export specification from the class level to the individual functions to which the users of the class need access. Then the application using the class can no longer create an instance of the class itself. Instead, it must call the createMyExportedCLtss helper function to create an instance of the class, and deleteMyExportedClass when it wishes to destroy the class. class CMyExportedClass { public: CMyExportedClass(void) : mdwValue(O) { } DLLFUNCTION void setValue(long dwValue) { mdwValue = dwValue; } DLLFUNCTION long getValue(void) { return mdwValue; } long clear-Value (void); private:

long mdwValue;
};
CMyExportedClass *createMyExportedClass(void) { return new CMyExportedClass; } void deleteMyExportedClass(CMyExportedClass *pclass) { delete pclass; }

1.4 Exporting C++ Classes from DLLs

31

It should also be noted that although CMyExportedClass::clearValue is a public member function, it can no longer be called by users of the class outside the DLL, as it is not declared as dllexported. This can be a powerful tool for a complex class that needs to make some functions publicly accessible to users of the class outside the DLL, yet still needs to have other public functions for use inside the DLL itself. An example of this strategy in practice is the SDK for Discreet's 3D Studio MAX. Most of the classes have a mix of exported and nonexported functions. This allows die user of the SDK to access or derive functionality as needed from the exported member functions, while enabling the developers of the SDK to have their own set of internally available member functions.

Exporting Virtual Class Member Functions


One potential problem should be noted for users of Microsoft Visual C++ 6. If you are attempting to export the member functions of a class, and you are not linking with the lib file of the DLL that exports the class (you're using LoadLibrary to load the DLL at runtime), you will get an "unresolved external symbol" for each function you reference if inline function expansion is disabled. This can happen regardless of whether the function is declared completely in the header. One fix for this is to change the inline function expansion to "Only inline" or "Any Suitable." Unfortunately, this may conflict with your desire to actually have inline function expansion disabled in a debug build. An alternate fix is to declare the functions virtual. The virtual declaration will cause the correct code to be generated, regardless of the setting of the inline function expansion option. In many circumstances, you'll likely want to declare exported member functions virtual anyway, so that you can both work around the potential Visual C++ problem and allow the user to override member functions as necessary.
class CMyExportedClass { public:

CMyExportedClass(void) : mdwValue(O) { } DLLFUNCTION virtual void setValue(long dwValue) { mdwValue = dwValue; } DLLFUNCTION virtual long getValue(void) { return mdwValue; } long clearValue(void); private: long mdwValue;

};
With exported virtual member functions, deriving from the exported class on the application side is the same as if the exported class were declared completely in the application itself.

32

Section 1

General Programming

class CMyApplicationClass : public CMyExportedClass { public:


CMyApplicationClass (void) { } virtual void setValue(long dwValue); virtual long getValue(void) ;

Summary
Exporting a class from a DLL is an easy and powerful way to share functionality without sharing source code. It can give the application all the benefits of a structured C++ class to use, derive from, or overload, while allowing the creator of the class to keep internal functions and variables safely hidden away.

1.5
Protect Yourself from DLL Hell and Missing OS Functions
Herb Marselas, Ensemble Studios
[email protected]

ynamic Link Libraries (DLLs) are a powerful feature of Microsoft Windows. They have many uses, including sharing executable code and abstracting out device differences. Unfortunately, relying on DLLs can be problematic due to their standalone nature. If an application relies on a DLL that doesn't exist on the user's computer, attempting to run it will result in a "DLL Not Found" message that's not helpful to the average user. If the DLL does exist on the user's computer, there's no way to tell if the DLL is valid (at least as far as the application is concerned) if it's automatically loaded when the application starts up. Bad DLL versions can easily find their way onto a system as the user installs and uninstalls other programs. Alternatively, there can even be differences in system DLLs among different Windows platforms and service packs. In these cases, the user may either get the cryptic "DynaLink Error!" message if the function being linked to in the DLL doesn't exist, or worse yet, the application will crash. All of these problems with finding and loading the correct DLL are often referred to as "DLL Hell." Fortunately, there are several ways to protect against falling into this particular hell.

Implicit vs. Explicit Linking


The first line of defense in protecting against bad DLLs is to make sure that the necessary DLLs exist on the user's computer and are a version with which the application can work. This must be done before attempting to use any of their functionality. Normally, DLLs are linked to an application by specifying their eponymous lib file in the link line. This is known as implicit DLL loading, or implicit linking. By linking to the lib file, the operating system will automatically search for and load the matching DLL when a program runs. This method assumes that the DLL exists, that Windows can find it, and that it's a version with which the program can work. Microsoft Visual C++ also supports three other methods of implicit linking. First, including a DLL's lib file directly into a project is just like adding it on the link line. Second, if a project includes a subproject that builds a DLL, the DLL's lib file is

33

34

Section 1

General Programming

automatically linked with the project by default. Finally, a lib can be linked to an application using the #pragma comment (lib "libname") directive. The remedy to this situation of implicit linking and loading is to explicitly load the DLL. This is done by not linking to the DLL's lib file in the link line, and removing any #pragma comment directives that would link to a library. If a subproject in Visual C++ builds a DLL, the link property page of the subproject should be changed by checking the "Doesn't produce .LIB" option. By explicitly loading the DLL, the code can handle each error that could occur, making sure the DLL exists, making sure the functions required are present, and so forth.

LoadLibrary and GetProcAddress


When a DLL is implicitly loaded using a lib file, the functions can be called directly in the application's code, and the OS loader does all the work of loading DLLs and resolving function references. When switching to explicit linking, the functions must instead be called indirectly through a manually resolved function pointer. To do this, the DLL that contains the function must be explicitly loaded using the LoadLibrary function, and then we can retrieve a pointer to the function using GetProcAddress.
HMODULE LoadLibrary(LPCTSTR IpFileName); FARPROC GetProcAddress(HMODULE hModule, LPCSTR IpProcName); BOOL FreeLibrary(HMODULE hModule);

LoadLibrary searches for the specified DLL, loads it into the applications process space if it is found, and returns a handle to this new module. GetProcAddress is then used to create a function pointer to each function in the DLL that will be used by the game. When an explicitly loaded DLL is no longer needed, it should be freed using FreeLibrary. After calling FreeLibrary, the module handle is no longer considered valid. Every LoadLibrary call must be matched with a FreeLibrary call. This is necessary because Windows increments a reference count on each DLL per process when it is loaded either implicitly by the executable or another DLL, or by calling LoadLibrary. This reference count is decremented by calling FreeLibrary, or unloading the executable or DLL that loaded this DLL. When the reference count for a given DLL reaches zero, Windows knows it can safely unload the DLL.

Guarding Against DirectX


One of the problems we have often found is that the required versions of DirectX components are not installed, or the install is corrupt in some way. To protect our game against these problems, we explicitly load the DirectX components we need. If we were to implicitly link to Directlnput in DirectX 8, we would have added the dinputS.lib to our link line and used the following code:

1.5 Protect Yourself from DLL Hell and Missing OS Functions

35

IDirectlnputS *pDInput; HRESULT hr = DirectInput8Create(hInstance, DIRECTINPUT_VERSION, IID_IDirectInput8, (LPVOID*) & pDInput, 0);
if {
}

(FAILED(hr)) // handle error - initialization error

The explicit DLL loading case effectively adds two more lines of code, but the application is now protected against dinput8.dll not being found, or of it being corrupt in some way.
typedef HRESULT (WINAPI* DirectInput8Create_PROC) (HINSTANCE hinst, DWORD dwVersion, REFIID riidltf, LPVOID* ppvOut, LPUNKNOWN punkOuter); HMODULE hDInputLib = LoadLibrary( "dinput8.dll") ; if (! hDInputLib) {

// handle error - DInput 8 not found. Is it installed incorrectly // or at all? DirectInput8Create_PROC diCreate; diCreate = (DirectInput8Create_PROC) GetProcAddress(hDInputLib, "DirectlnputSCreate") ; if (! diCreate) { // handle error - DInput 8 exists, but the function can't be // found.
HRESULT hr = (diCreate) (hlnstance, DIRECTINPUT_VERSION, I ID_IDirect Inputs, (LPVOID*) &mDirectInput, NULL); if (FAILED(hr)) {

// handle error - initialization error First, a function pointer typedef is created that reflects the function DirectlnputSCreate. The DLL is then loaded using LoadLibrary. If the dinput8.dll was loaded successfully, we then attempt to find the function DirectlnputSCreate using GetProcAddress. GetProcAddress returns a pointer to the function if it is found, or NULL if the function cannot be found. We then check to make sure the function pointer is valid. Finally, we call DirectlnputSCreate through the function pointer to initialize Directlnput.

36

Section 1

General Programming

If there were more functions that needed to be retrieved from the DLL, a function pointer typedefand variable would be declared for each. It might be sufficient to only check for NULL when mapping the first function pointer using GetProcAddress. However, as more error handling is usually not a bad thing, checking every GetProcAddress for a successful non-NULL return is probably a good thing to do.

Using OS-Specific Features

Another issue that explicit DLL loading can resolve is when an application wants to take advantage of a specific API function if it is available. There is an extensive number of extended functions ending in "Ex" that are supported under Windows NT or 2000, and not available in Windows 95 or 98. These extended functions usually provide more information or additional functionality than the original functions do . An example of this is the CopyFileEx function, which provides the ability to cancel a long file copy operation. Instead of calling it directly, kernel32.dll can be loaded using LoadLibrary and the function again mapped with GetProcAddress. If we load kernel32.dll and find CopyFileEx, we use it. If we don't find it, we can use the regular CopyFile function. One other problem that must be avoided in this case is that CopyFileEx is really only a #define replacement in the winbase.h header file that is replaced with CopyFileExA or CopyFileExW if compiling for ASCII or wide Unicode characters, respectively.
typedef BOOL (WINAPI *CopyFileEx_PROC) (LPCTSTR IpExistingFileName, LPCTSTR IpNewFileName , LPPROGRESS_ROUTINE IpProgressRoutine, LPVOID IpData, LPBOOL pbCancel, DWORD dwCopyFlags) ; HMODULE hKerne!32 = LoadLibrary("kernel32.dH") ; if (!hKerne!32) {

// handle error - kernel32.dll not found. Wow! That's really bad

}
CopyFileEx_PROC pfnCopyFileEx; pfnCopyFileEx = (CopyFileEx_PROC) GetProcAddress(hKernel32, "CopyFileExA") ; BOOL bReturn; if (pfnCopyFileEx) { / / use CopyFileEx to copy the file bReturn = (pfnCopyFileEx) (pExistingFile, pDestinationFile, ...);
else

// use the regular CopyFile function bReturn = CopyFilefpExistingFile, pDestinationFile, FALSE);

1.5 Protect Yourself from DLL Hell and Missing OS Functions

37

The use of LoadLibrary and GetProcAddress can also be applied to game DLLs. One example of this is the graphics support in a game engine currently under development at Ensemble Studios, where graphics support for Direct3D and OpenGL has been broken out into separate DLLs that are explicitly loaded as necessary. If Direct3D graphics support is needed, the Direct3D support DLL is loaded with LoadLibrary and the exported functions are mapped using GetProcAddress. This setup keeps the main executable free from having to link implicitly with either dddS.lib or opengl32.lib. However, the supporting Direct3D DLL links implicitly with dddS.lib, and the supporting OpenGL DLL links implicitly with opengl32. lib. This explicit loading of the game's own DLLs by the main executable, and implicit loading by each graphics subsystem solves several problems. First, if an attempt to load either library fails, it's likely that that particular graphics subsystem files cannot be found or are corrupt. The main program can then handle the error gracefully. The other problem that this solves, which is more of an issue with OpenGL than Direct3D, is that if the engine were to link explicitly to OpenGL, it would need a typedef and function pointer for every OpenGL function it used. The implicit linking to the support DLL solves this problem.

Summary
Explicit linking can act as a barrier against a number of common DLL problems that are encountered under Windows, including missing DLLs, or versions of DLLs that aren't compatible with an application. While not a panacea, it can at least put the application in control and allow any error to be handled gracefully instead of with a cryptic error message or an outright crash.

1.6
Dynamic Type Information
Scott Wakeling, Virgin Interactive
[email protected]

s developers continue to embrace object orientation, the systems that power games are growing increasingly flexible, and inherently more complex. Such systems now regularly contain many different types and classes; counts of over 1000 are not unheard of. Coping with so many different types in a game engine can be a challenge in itself. A type can really mean anything from a class, to a struct, to a standard data type. This gem discusses managing types effectively by providing ways of querying their relations to other types, or accessing information about their type at runtime for query or debug purposes. Toward the end of the gem, an approach for supporting persistent objects is suggested with some ideas about how the method can be extended.

Introducing the Dynamic Type Information Class


In our efforts to harness the power of our types effectively, we'll be turning to the aid of one class in particular: the dynamic type information (DTI) class. This class will store any information that we may need to know about the type of any given object or structure. A minimal implementation of the class is given here: class dtiClass { private: char* szName; dtiClass* pdtiParent; public: dtiClass(); dtiClass( char* szSetName, dtiClass* pSetParent ); virtual -dtiClass(); const char* GetName(); bool SetName( char* szSetName ); dtiClass* GetParent(); bool SetParent( dtiClass* pSetParent );

38

1.6 Dynamic Type Information

39

'~^J^__J) ONTHICO

In order to instill DTI into our engine, all our classes will need a dtiClass as a static member. It's this class that allows us to access a class name for debug purposes and query the dtiClass member of the class's parent. This member must permeate the class tree all the way from the root class down, thus ensuring that all game objects have access to information about themselves and their parents. The implementation ofdtiClass can be found in the code on the accompanying CD.

Exposing and Querying the DTI


Let's see how we can begin to use DTI by implementing a very simple class tree as described previously. Here is a code snippet showing a macro that helps us define our static dtiClass member, a basic root class, and simple initialization of the class's type info: #define EXPOSE_TYPE \ public: \ static dtiClass Type; class CRootClass { public:
EXPOSE_TYPE;

CRootClass() {}; virtual -CRootClass() {};


};

dtiClass CRootClass::Type( "CRootClass", NULL ); By including the EXPOSE_TYPE macro in all of our class definitions and initializing the static Type member correctly as shown, we've taken the first step toward instilling dynamic type info in our game engine. We pass our class name and a pointer to the class's parent's dtiClass member. The dtiClass constructor does the rest, setting up the szName and pdtiParent members accordingly. We can now query for an object's class name at runtime for debug purposes of other type-related cases, such as saving or loading a game. More on that later, but for now, here's a quick line of code that will get us our class name: // Let's see what kind of object this pointer is pointing to const char* szGetName = pSomePtr->Type.GetName(); In the original example, we passed NULL in to the dtiClass constructor as the class's parent field because this is our root class. For classes that derive from others, we just need to specify the name of the parent class. For example, if we were to specify a child class of our root, a basic definition might look something like this: class CChildClass : public CRootClass {
EXPOSE TYPE;

40

Section 1 // Constructor and virtual Destructor go here

General Programming

};
dtiClass CChildClass::Type( "CChildClass", &CRootClass::Type );

Now we have something of a class tree growing. We can access not only our class's name, but the name of its parent too, as long as its type has been exposed with the EXPOSE_TYPE macro. Here's a line of code that would get us our parent's name:
// Let's see what kind of class this object is derived from char* szParentName = pSomePtr->Type.GetParent()->GetName();

Now that we have a simple class tree with DTI present and know how to use that information to query for class and parent names at runtime, we can move on to implementing a useful method for safeguarding type casts, or simply querying an object about its roots or general type.

Inheritance Means "IsA"


Object orientation gave us the power of inheritance. With inheritance came polymorphism, the ability for all our objects to be just one of many types at any one time. In many cases, polymorphism is put to use in game programming to handle many types of objects in a safe, dynamic, and effective manner. This means we like to ensure that objects are of compatible types before we cast them, thus preventing undefined behavior. It also means we like to be able to check what type an object conforms to at runtime, rather than having to know from compiler time, and we like to be able to do all of these things quickly and easily. Imagine that our game involves a number of different types of robots, some purely electronic, and some with mechanical parts, maybe fuel driven. Now assume for instance that there is a certain type of weapon the player may have that is very effective against the purely electronic robots, but less so against their mechanical counterparts. The classes that define these robots are very likely to be of the same basic type, meaning they probably both inherit from the same generic robot base class, and then go on to override certain functionality or add fresh attributes. To cope with varying types of specialist child classes, we need to query their roots. We can extend the dtiClass introduced earlier to provide us with such a routine. We'll call the new member function IsA, because inheritance can be seen to translate to "is a type of." Here's the function:
bool dtiClass::IsA( dtiClass* pType ) { dtiClass* pStartType = this; while( pStartType ) { if ( pStartType == pType )

1.6 Dynamic Type Information

41

return true; else pStartType = pStartType->GetParent(); return false;

If we need to know whether a certain robot subclass is derived from a certain root class, we just need to call IsA from the object's own dtiClass member, passing in the static dtiClass member of the root class. Here's a quick example:
CRootClass* pRoot; CChildClass* pChild = new CChildClass(); if ( pChild->Type.IsA( &CRootClass::Type ) ) pRoot = (CRootClass*)pChild;

We can see that the result of a quick IsA check tells us whether we are derived, directly or indirectly, from a given base class. Of course, we might use this fact to go on and perform a safe casting operation, as in the preceding example. Or, maybe we'll just use the check to filter out certain types of game objects in a given area, given that their type makes them susceptible to a certain weapon or effect. If we decide that a safe casting operation is something we'll need regularly, we can add the following ^-_1-^ function to the root object to simplify matters. Here's the definition and a quick example; the function's implementation is on the accompanying CD:
// SafeCast member function definition added to CRootClass void* SafeCast( dtiClass* pCastToType ); // How to simplify the above operation pRoot = (CRootClass*)pChild->SafeCast( &CRootClass::Type );

If the cast is not safe (in other words, the types are not related), dien the value will evaluate to nothing, and pRoot will be NULL.

Handling Generic Objects


Going back to our simple game example, let's consider how we might cope with so many different types of robot effectively. The answer starts off quite simple: we can make use of polymorphism and just store pointers to them all in one big array of generic base class pointers. Even our more specialized robots can be stored here, such as CRobotMech (derived from CRobof), because polymorphism dictates that for any type requirement, a derived type can always be supplied instead. Now we have our vast array of game objects, all stored as pointers to a given base class. We can iterate

42

Section 1

General Programming

over them safely, perhaps calling virtual functions on each and getting the more specialized (overridden) routines carried out by default. This takes us halfway to handling vast numbers of game objects in a fast, safe, and generic way. As part of our runtime type info solution, we have the IsA and SafeCast routines that can query what general type an object is, and cast it safely up the class tree. This is often referred to as up-casting, and it takes us halfway to handling vast numbers of game objects in a fast, safe, and generic way. The other half of the problem comes with down-castingcasting a pointer to a generic base class safely down to a more specialized subclass. If we want to iterate a list of root class pointers, and check whether each really points to a specific type of subclass, we need to make use of the dynamic casting operator, introduced by C++. The dynamic casting operator is used to convert among polymorphic types and is both safe and informative. It even returns applicable feedback about the attempted cast. Here's the form it takes: dynamic_cast< type-id >(expression) The first parameter we must pass in is the type we wish expression to conform to after the cast has taken place. This can be a pointer or reference to one of our classes. If it's a pointer, the parameter we pass in as expression must be a pointer, too. If we pass a reference to a class, we must pass a modifiable l-value in the second parameter. Here are two examples: // Given a root object (RootObj), on pointer (pRoot) we // can down-cast like this CChildClass* pChild = dynamic_cast<CChildClass*>(pRoot); CChildClass& ChildObj = dynamic_cast<CChildClass&>(RootObj); To gain access to these extended casting operators, we need to enable embedded runtime type information in the compiler settings (use the /GR switch for Microsoft Visual C++). If the requested cast cannot be made (for example, if the root pointer does not really point to anything more derived), the operator will simply fail and the expression will evaluate to NULL. Therefore, from the preceding code snippet, (f :,js*:*:*'% pChild would evaluate to NULL IfpRoot really did only point to a CRootClass object. ON me a> If the cast of RootObj failed, an exception would be thrown, which could be contained with a try I catch block (example is included on the companion CD-ROM). The dynamic_cast operator lets us determine what type is really hidden behind a pointer. Imagine we want to iterate through every robot in a certain radius and determine which ones are mechanical models, and thus immune to the effects of a certain weapon. Given a list of generic CRobot pointers, we could iterate through these and perform dynamic casts on each, checking which ones are successful and which resolve to NULL, and thus exacting which ones were in fact mechanical. Finally, we can now safely down-cast too, which completes our runtime type information solution. The

1.6 Dynamic Type information

43

-., code on the companion CD-ROM has a more extended example of using the on m CD dynamic casting operator.
c

Implementing Persistent Type Information


Now that our objects no longer have an identity crisis and we're managing them effectively at runtime, we can move on to consider implementing a persistent object solution, thus extending our type-related capabilities and allowing us to handle things ,- c ") such as game saves or object repositories with ease. The first thing we need is a baremtmco bones implementation of a binary store where we can keep our object data. An example implementation, CdtiBin can be found on the companion CD-ROM. There are a number of utility member functions, but the two important points are the Stream member function, and the friend operators that allow us to write out or load die basic data types of the language. We'll need to add an operator for each basic type we want to persist. When Stream is called, the data will be either read from the file or written, depending on the values of m_bLoading and m_bSaving. To let our classes know how to work with the object repositories we need to add the Serialize function, shown here:
virtual void Serialize( CdtiBin& ObjStore );

Note that it is virtual and needs to be overridden for all child classes that have additional data over their parents. If we add a simple integer member to CRootClass, we would write the Serialize function like this:
void CRootClass::Serialize( CdtiBin& ObjStore )

{
ObjStore iMemberlnt;

We would have to be sure to provide the friend operator for integers and CdtiBin objects. We could write object settings out to a file, and later load them back in and repopulate fresh objects with die old data, thus ensuring a persistent object solution for use in a game save routine. All types would thus know how to save themselves, making our game save routines much easier to implement. However, child classes need to write out their data and that of their parents. Instead of forcing the programmer to look up all data passed down from parents and adding it to each class's Serialize member, we need to give each class access to its parent's Serialize routine. This allows child classes to write (or load) their inherited data before their own data. We use the DECLAREJSUPER macro for this:
#define DECLARE_SUPER(SuperClass) \ public: \ typedef Superclass Super;

44

Section 1

General Programming

class CChildClass
DECLARE_SUPER(CRootClass);

This farther extends our type solution by allowing our classes to call their immediate parents' versions of functions, making our class trees more extensible. CRootClass doesn't need to declare its superclass because it doesn't have one, and thus its Serialize member only needs to cope with its own data. Here's how CChildClass::Serialize calls CRootClass:Serialize before dealing with some of its own data (added specifically for the example):
void CChildClass::Serialize( CdtiBin& ObjStore ) { Super::Serialize( ObjStore ); ObjStore fMemberFloat iAnotherlnt;

A friend operator for the float data type was added to support the above. Note that the order in which attributes are saved and loaded is always the same. Code showing how to create a binary store, write a couple of objects out, and then repopulate the objects' attributes can be found on the companion CD-ROM. As long as object types are serialized in the same order both ways, their attributes will remain persistent between saves and loads. Adding the correct friend operators to the CdtiBin class adds support for basic data types. If we want to add user-defined structures to our class members, we just need to write an operator for coping with that struct. With this in place, all objects and types in the engine will know precisely how to save themselves out to a binary store and read themselves back in.

Applying Persistent Type Information to a Game Save Database


As mentioned previously, objects need to be serialized out and loaded back in the same order. The quickest and easiest method is to only save out one object to the game saves, and then just load that one back in. If we can define any point in the game by constructing some kind of game state object that knows precisely how to serialize itself either way, then we can write all our game data out in one hit, and read it back in at any point. Our game state object would no doubt contain arrays of objects. As long as the custom array type knows how to serialize itself, and we have all the correct CdtiBin operators written for our types, everything will work. Saving and loading a game will be a simple matter of managing the game from a high-level, allencompassing containment class, calling just the one Serialize routine when needed.

1.6 Dynamic Type Information

45

Conclusion
There is still more that could be done than just the solution described here. Supporting multiple inheritance wouldn't be difficult. Instead of storing just the one parent pointer in our static dtiClass, we would store an array of as many parents a class had, specifying the count and a variable number of type classes in a suitable macro, or by extending the dtiClass constructor. An object flagging system would also be useful, and would allow us to enforce special cases such as abstract base classes or objects we only ever wanted to be contained in other classes, and never by themselves ("contained classes").

References
[Meyers98] Meyers, Scott D., Effective C++ 2ndEdition, Addison-Wesley, 1998. [Wilkie94] Wilkie, George, Object-Oriented Software Engineering, Addison-Wesley, 1994. [EberlyOO] Eberly, David H., 3D Game Engine Design, Morgan Kauffman, 1999-2000. [WakelingOl] Wakeling, Scott J., "Coping with Class Trees," available online at www.chronicreality.com/articles, March 12, 2001.

1.7
A Property Class for Generic C++ Member Access
Charles Cafrelli
[email protected]

ractically every game has a unique set of game objects, and any code that has to manipulate those objects has to be written from scratch for each project. Take, for example, an in-game editor, which has a simple purpose: to create, place, display, and edit object properties. Object creation is almost always specific to the game, or can be handled by a class factory. Object placement is specific to the visualization engine, which makes reuse difficult, assuming it is even possible to visually place an object on the map. In some cases, a generic map editor that can be toggled on and off (or possibly superimposed as a heads-up display) can be reused from game to game. Therefore, in theory, it should be possible to develop a core editor module that can be reused without having to rewrite the same code over and over again for each project. However, given that all games have unique objects, how does the editor know what to display for editing purposes without rewriting the editor code? What we need is a general object interface that allows access to the internals of a class. Borland's C++ Builder provides an excellent C++ declaration type called ^property that does this very thing, but alas, it is a proprietary extension and unusable outside of Borland C++. Interestingly enough, C#, Microsoft's new programming language developed by the creator of Borland C++ Builder, contains the same feature. Microsoft's COM interface allows runtime querying of an object for its members, but it requires that we bind our objects to the COM interface, making them less portable than straight C++. This leaves a "roll-your-own" solution, which can be more lightweight than COM, and more portable than proprietary extensions to the C++ language. This will allow code modules such as the in-game editor to be written just once, and used across many engines.

The Code
The interface is broken into two classes: a Property class and a PropertySet class. Property is a container for one piece of data. It contains a union of pointers to different data types, an enumeration for the type of data, and a string for the property name. The full source code can be found on the companion CD.
46

1.7 A Property Class for Generic C++ Member Access

47

class Property { protected: union Data { int* m_int; float* m_float; std::string* m_string; bool* m_bool; enum Type {
INT,

FLOAT, STRING, BOOL, EMPTY Data m_data; Type m_type; std:: string m_name;

protected: void EraseType() ; void Register(int* value); void Registerffloat* value); void Registerfstd: :string* new_string); void Registerfbool* value); public: Property () ; Property(std: :string const& name); Property(std: :string const& name, int* value); Property (std :: string const& name, float* value); Property (std :: string const& name, std::string* value); Property (std :: string const& name, bool* value); -Property () ; bool bool bool bool bool SetUnknownValue(std: :string const& value); Set (int value) ; Set(float value); Set(std: :string const& value); Set(bool value);

void SetNamefstd: :string const& name); std:: string GetName() const; int Getlnt(); float GetFloatf); std:: string GetString(); bool GetBool() ;

48

Section 1

General Programming

The example code shows basic data types being used and stored, although these could be easily expanded to handle any data type. Properties store only a pointer back to the original data. Properties do not actually declare their own objects, or allocate their own memory, so manipulating a property's data results in the original data's memory being handled. Setting a value via the Set function automatically defines the type of the property. Properties are constructed and manipulated through a PropertySet class. The PropertySet class contains the list of registered properties, the registration methods, and the lookup method. class PropertySet { protected: HashTable<Property>m_properties; public: PropertySet(); virtual -PropertySet(); void void void void Register(std::string Register(std::string Register(std::string Register(std::string const& const& const& const& name, name, name, name, int* value); float* value); std::string* value); bool* value);

// look up a property Property* Lookup(std::string const& name); // get a list of available properties bool SetValue(std::string const& name, std::string* value); bool Set(std::string const& name, std::string const& value); bool Set(std::string const& name, int value); bool Set(std::string const& name, float value); bool Set(std::string const& name, bool value); bool Set(std::string const& name, char* value);

};
The PropertySet is organized around a HashTable object that organizes all of the stored properties using a standard hash table algorithm. The HashTable itself is a template that can be used to hash into different objects, and is included on the companONIHfCD

ion UJ.

f^r-~.

We derive the game object from the PropertySet class: class GameObject : public PropertySet
{

int m_test;

};
Any properties or flags that need to be publicly exposed or used by other objects should be registered, usually at construction time. For example:
Register("test_value",&m_test);

1.7 A Property Class for Generic C++ Member Access

49

Calling objects can use the Lookup method to access the registered data. void Update(PropertySet& property_set)
{

Property* test_value_property= property_set.Lookup("test_value"); int test_value = test_value_property->GetInt(); // etc

}
As all of the game objects are now of type PropertySet, and as all objects are usually stored in a master update list, it is a simple matter of handing the list pointer off to the in-game editor for processing. New derived object types simply have to register their additional properties to be handled by the editor. No additional coding is necessary because the editor is not concerned with the derived types. It is sometimes helpful to specify the type in a property name (such as "Type") to assist the user when visually editing the object. It's also useful to make the property required, so that the editor could, for example, parse the property list into a "tree" style display. This process also provides the additional benefit of decoupling the data from its name. For instance, internally, the data may be referred to as m_colour, but can be exposed as "color."

Additional Uses
These classes were designed around a concentric ring design theory. The PropertySet cannot be used without the Property class. However, the Property class can be used on its own, or with another set type (for example, MultiMatrixedPropertySef) without rewriting the Property class itself. This is true of the HashTable inside the PropertySet class as well. Smaller classes with distinct and well-defined purposes and uses are much more reusable than large classes with many methods to handle every possible use. The Property class can also be used to publicly expose methods that can be called from outside code via function pointers. With a small amount of additional coding, this can also be used as a save state for a save game feature as well. It could also be used for object messaging via networks. With the addition of a Send(std::string xml) and Receive(std::stringxml), the PropertySet could easily encode and decode XML messages that contain the property values, or property values that need to be changed. The Property!PropertySet classes could also be rewritten as templates to support different property types. Isolating the property data using "get" and "set" methods will allow for format conversion to and from the internal stored format. This will free the using code from needing to know anything about the data type of the property, making it more versatile at the cost of a small speed hit when the types differ.

50

Section 1

General Programming

Additional Reading
Fowler, Martin, Kent Beck, John Brant, William Opdyke, Don Roberts, Refactoring, Addison-Wesley, ISBN: 0201485672. Gamma, Erich, Richard Helm, Ralph Johnson, John Vlissides, Grady Booch, Design Patterns, Addison-Wesley, ISBN: 0201633612. Lakos, John, Large-Scale C++ Software Design, Addison-Wesley, ISBN: 0201633620. McConnell, Steve C., Code Complete: A Practical Handbook of Software Construction, Microsoft Press, ISBN: 1556154844 (anything by McConnell is good). Meyers, Scott, Effective C++: 50 Specific Ways to Improve Your Programs and Design (2ndEdition), Addison-Wesley, ISBN: 0201924889. Meyers, Scott, More Effective C++: 35 New Ways to Improve Your Programs and Designs, Addison-Wesley, ISBN: 020163371X.

1.8
A Game Entity Factory
Frangois Dominic Laramee
[email protected]

n recent years, scripting languages have proven invaluable to the game development community. By isolating the elaboration and refinement of game entity behavior from the core of the code base, they have liberated level designers from the codecompile-execute cycle, speeding up game testing and tweaking by orders of magnitude, and freed senior programmers' time for more intricate assignments. However, for the data-driven development paradigm to work well, the game's engine must provide flexible entity construction and assembly services, so that the scripting language can provide individual entities with different operational strategies, reaction behaviors, and other parameters. This is the purpose of this gem: to describe a hierarchy of C++ classes and a set of techniques that support data-driven development on the engine side of things. This simple framework was designed with the following goals in mind: A separation of logical behavior and audio-visual behavior. A single Door class can support however many variations of the concept as required, without concern for size, number of key frames in animation sequences, etc. Rapid development. Once a basic library of behaviors has been defined (which takes surprisingly little time), new game entity classes can be added to the framework with a minimum of new code, often in 15 minutes or less. Avoiding code duplication. By assembling bits and pieces of behavior into new entities at runtime, the framework avoids the "code bloat" associated with scripting languages that compile to C/C++, for example. Several of the techniques in this gem are described in terms of patterns, detailed in the so-called "Gang of Four's" book Design Patterns [GoF94].

Components
The gem is built around three major components: flyweight objects, behavioral classes and an object factory method. We will examine each in turn, and then look at how they work together to equip the engine with the services required by data-driven development. Finally, we will discuss advanced ideas to make the system even more

51

52

Section 1

General Programming

flexible (at the cost of some code complexity) if a full-fledged scripting language is required by the project.

Flyweight, Behavior, and Exported Classes


Before we go any further, we must make a distinction between the three types of "classes" to which a game entity will belong in this framework: its flyweight, behavioral, and exported classes. The flyweight class is the look and feel of the entity. In the code, the relationship between an entity and its flyweight class is implemented through object composition: the entity owns a pointer to a flyweight that it uses to represent itself audiovisually. The behavioral class defines how the object interacts with the rest of the game world. Behavioral classes are implemented as a traditional inheritance hierarchy, with class Entity serving as abstract superclass for all others. The exported class is how the object represents itself to the world. More of a convenience than a requirement, the exported class is implemented as an enum constant and allows an entity to advertise itself as several different object classes during its lifetime. Let us now look at each in turn.

Flyweight Objects
[GoF94] describes flyweights as objects deprived of their context so that they can be shared and used in a variety of situations simultaneously; in other words, as a template or model for other objects. For a game entity, the flyweight-friendly information consists of: Media content: Sound effects, 3D models, textures, animation files, etc. Control structure: Finite state machine definition, scripts, and the like. As you can see, this is just about everything except information on the current status of the entity (position, health, FSM state). Therefore, in a gaming context, the \&iv\ fly weight is rather unfortunate, because the flyweight can consume megabytes of memory, while the context information would be small enough to fit within a crippled toaster's core memory.

SAMMy, Where Are You?


Much of a game entity's finite state machine deals with animation loops, deciding when to play a sound byte, and so forth. For example, after the player character is killed in an arcade game, it may enter the resurrecting state and be flagged as invulnerable while the "resurrection" animation plays out; otherwise, an overeager monster

1 .8 A Game Entity Factory

53

might hover about and score another kill during every frame of animation until the player resumes control over it. I call the part of the flyweight object that deals with this the State And Media Manager, or SAMMy for short:
class StateAndMediaManager {

// The various animation sequences available for the family //of entities AnimSequenceDescriptionStruct * sequences; int numAnimSequences; // A table of animation sequences to fire up when the entity's FSM // changes states out of SAMMy 's control int * stateToAnimTransitions; int numStateToAnimTransitions;

public: // Construction and destruction // StateAndMediaManager is always constructed by its owner entity, // which is in charge of opening its description file. Therefore, // the only parameter the constructor needs is a reference to an // input stream from which to read a set of animation sequence // descriptions. StateAndMediaManager () : sequences( 0 ), numAnimSequences ( 0 ), numStateToAnimTransitions ( 0 ), stateToAnimTransitions ( 0 ) {} StateAndMediaManager ( istream & is ) ; virtual -StateAndMediaManager () ; void Cleanup() ; // Input-output functions void Load( istream & is ) ; void Save( ostream & os ) ; // Look at an entity's current situation and update it according // to the description of its animation sequences void FC UpdateEntityStatef EntityStateStruct * state ); // If the entity's FSM has just forced a change of state, the media // manager must follow suit, interrupt its current animation // sequence and choose a new one suitable to the new FSM state void FC AlignWithNewFSMState( EntityStateStruct * state ); };
(

Typically, SAMMy is the product of an entity-crafting tool, and it is loaded into -*-^_^ the engine from a file when needed. The sample on the companion CD-ROM is built ON mat SAMMy can be made as powerful and versatile as desired. In theory, SAMMy could take care of all control functions: launching scripts, changing strategies, and so forth. However, this would be very awkward and require enormous effort; we will instead choose to delegate most of the high-level control structure to the behavioral class hierarchy, which can take care of it with a minute amount of code. (As a side

54

Section 1 General Programming effect of this sharing of duties, a single behavioral class like SecurityGuard, Door or ExplosionFX will be able to handle entities based on multiple related flyweights, making the system more flexible.)

Behavioral Class Hierarchy


These are the actual C++ classes to which our entities will belong. The hierarchy has (at least) two levels: An abstract base class Entity that defines the interface and commonalities Concrete subclasses that derive from Entity and implement actual objects Here is a look at Entity's interface: class Entity { // Some application-specific data // Flyweight and Exported Class information int exportedClassID; StateAndMediaManager * sammy; public: // Constructors // Accessors int GetExportedClass() { return exportedClassID; } StateAndMediaManager * GetFlyweight() { return sammy; } void SetExportedClass( int newval ) { exportedClassID = newval; } void SetFlyweight( StateAndMediaManager * ns ) { sammy = ns; } // Factory method static Entity * EntityFactory( int exportedClassRequested ); virtual Entity * CloneEntityO = 0; virtual bool Updateself() { //Do generic stuff here; looping through animations, etc. return true; }
virtual bool Handlelnteractions( Entity * target ) = 0;

};

As you can see, adding a new class to the hierarchy may require very little work: in addition to constructors, at most three, and possibly only two, of the base class methods must be overriddenand one of them is a one-liner. Clone () is a simple redirection call to the copy constructor. UpdateSelf () runs the entity's internal mechanics. For some, this may be as simple as calling the corresponding method in SAMMy to update the current animation frame; for others, like the player character, it can be far more elaborate.

1.8 A Game Entity Factory

55

Handlelnteractions() is called when the entity is supposed to determine whether it should change its internal state in accordance to the behaviors and positions of other objects. The default implementation is empty; in other words, the object is inert window-dressing. '\^_^J m m co The companion CD-ROM contains examples of Entity subclasses, including one of a player character driver.

Using the Template Method Pattern for Behavior Assignment


If your game features several related behavioral classes whose differences are easy to circumscribe, you may be able to benefit from a technique known as the Template Method pattern [GoF94]. This consists of a base class method that defines an algorithm in terms of subclass methods it calls through polymorphism. For example, all types of PlayerEntity objects will need to query an input device and move themselves as part of their UpdateSelf () method, but how they do it may depend on the input device being used, the character type (a FleetingRogue walks faster than a OneLeggedBuddha), and so forth. Therefore, the PlayerEntity class may define UpdateSelf () in terms of pure virtual methods implemented only in its subclasses.
class PlayerDevice : public Entity { // ...

void UpdateYourself(); void QuerylnputDeviceO = 0;

// No implementation in PlayerDevice

};
class JoystickPlayerDevice : public PlayerDevice { // ... void QuerylnputDeviceO;

};
void PlayerDevice::UpdateYourself() { / / d o stuff common to all types of player devices QuerylnputDeviceO; //do more stuff

}
void JoystickPlayerDevice::QueryInputDevice() { / / D o the actual work } Used properly, the Template Method pattern can help minimize the need for the dreaded cut-and-paste programming, one of the most powerful "anti-patterns" leading to disaster in software engineering [Brown98]!

56

Section 1

General Programming

Exported Classes
The exported class is a convenience trick that you can use to make your entities' internal state information transparent to the script writer. For example, let's say that you are programming Pac-Man's Handlelnteractions( ) method. You might start by looking for a collision with one of the ghosts; what happens if one is found then depends on whether the ghost is afraid (it gets eaten), returning to base after being eaten (nothing happens at all), or hunting (Pac-Man dies). void PacMan: :HandleInteractions( Entity * target ) { if ( target ->GetClass() == GHOST && target ->GetState() == AFRAID ) { score += 100; target ->SendKillSignal( ) ;

However, what if you need to add states to the Ghost object? For example, you may want the ghost's SAMMy to include a "Getting Scared" animation loop, which is active for one second once Pac-Man has run over a power pill. SAMMy would handle this cleanly if you added a GettingScared state. However, you would now need to add a test for the GettingScared state to the event handler. void PacMan: :HandleInteractions( Entity * target ) { if ( target ->GetClass() == GHOST && ( target->GetState() == AFRAID || target ->GetState() == GETTINGSCARED ) )

This is awkward, and would likely result in any number of updates to the event handlers as you add states (none of which introduce anything new from the outside world's perspective) to SAMMy during production. Instead, let's introduce the concept of the exported class, a value that can be queried from an Entity object and describes how it advertises itself to the world. The value is maintained within Update Self () and can take any number of forms; for simplicity's sake, let's pick an integer constant selected from an enum list.
enum { SCAREDGHOST, ACTIVEGHOST, DEADGHOST };

There is no need to export any information on transient, animation-related states like GettingScared. To Pac-Man, a ghost can be dead, active, or scared period. Whether it has just become scared two frames ago, has been completely terrified for a while, or is slowly gathering its wits back around itself is irrelevant. By using an exported class instead of an actual internal FSM state, a Ghost object can advertise

1.8 A Game Entity Factory

57

itself as a dead ghost, scared ghost, or active ghost, effectively shape-shifting into three different entity classes at will from the outside world's perspective, all for the cost of an integer. Pac-Man's interaction handler would now look like this: void PacMan: :HandleInteractions( Entity * target )
{

if ( target ->GetExportedClass() == SCAREDGHOST ) { score += 100; target->SendKillSignal() ;

The result is cleaner and will require far less maintenance work, as the number of possible exported classes for an Entity is usually small and easy to determine early on, while SAMMy's FSM can grow organically as new looks and effects are added to the object during development.

The Entity Factory


Now that we have all of these tools, it is time to put them to good use in object creation. A level file will contain several entity declarations, each of which may identify the entity's behavioral class, flyweight class, exported class, starting position and velocity, and any number of class-specific parameters (for example, hit points for monsters, a starting timer for a bomb, a capacity for the players inventory, etc.) To keep things simpler for us and for the level designer, let's make the fairly safe assumption that, while a behavioral class may advertise itself as any number of exported classes, an exported class can only be attached to a single behavioral class. This way, we eliminate the need to specify the behavioral class in the level file, and isolate the class hierarchy from the tools and level designers. A snippet from a level file could therefore look like:
<ENTITY Blinky> <EXPORTEDCLASS ActiveGhOSt> <XYZ_POSITION ...> PARAMETERS ...> </ENTITY>

Now, let's add a factory method to the Entity class. A factory is a function whose job consists of constructing instances of any number of classes of objects on demand; in our case, the factory will handle requests for all (concrete) members of the behavioral class hierarchy. Programmatically, our factory method is very simple: It owns a registry that describes the flyweights that have already been loaded into the game and a list of the exported classes that belong to each behavioral class. It loads flyweights when needed. If a request for an instance belonging to a flyweight class that hasn't been seen yet is received, the first order of business is to create and load a SAMMy object for this flyweight.

58

Section 1 General Programming If the request is for an additional instance of an already-loaded flyweight class, the factory will clone the existing object (which now serves as a Prototype; yes, another Gang of Four pattern!) so that it and its new brother can share flyweights effectively. Here is a snippet from the method:
Entity * Entity::EntityFactory( int whichType ) {

Entity * ptr; switch( whichType ) { case SCAREDGHOST: ptr = new Ghost( SCAREDGHOST ); break; case ACTIVEGHOST: ptr = new Ghost( ACTIVEGHOST ); break;

return ptr; Simple, right? Calling the method with an exported class as parameter returns a pointer to an Entity subclass of the appropriate behavioral family. Entity * newEntity = Entity::EntityFactory( ACTIVEGHOST ); The code located on the companion CD-ROM also implements a simple trick used to load levels from standard text files: an entity's constructor receives the level file ^ an jstream parameter, and it can read its own class-specific parameters directly from it. The factory method therefore does not need to know anything about the internals of the subclasses it is responsible for creating.

uimca

Selecting Strategies at Runtime


The techniques described so far work fine when a game contains a small number of behavioral classes, or when entity actions are easy enough to define without scripts. However, what if extensive tweaking and experimentation with scripts is required? What if you need a way to change an entity's strategy at runtime, without necessarily influencing its behavioral classmates? This is where the Strategy pattern comes into play. (It's the last one, I promise. I think.) Let's assume that your script compiler produces C code. What you need is a way to connect the C function created by the compiler with your behavioral class (or individual entity). This is where function pointers come into play.

1.8 A Game Entity Factory

Using Function Pointers within C++ Classes

The simplest and best way to plug a method into a class at runtime is through a function pointer. A quick refresher: A C/C++ function pointer is a variable containing a memory address, just like any other pointer, except that the object being pointed to is a typed function defined by a nameless signature (in other words, a return type and a parameter list). Here is an example of a declaration of a pointer to a function taking two Entity objects and returning a Boolean value:
bool (*interactPtr) (Entity * source, Entity * target);

Assuming that there is a function with the appropriate signature in the code, for example:
bool TypicalRabbitInteractions( Entity * source, Entity * target )

then the variable interactPtr can be assigned to it, and the function called by dereferencing the pointer, so that the following snippets are equivalent:
Ok = TypicalRabbitInteractions( BasilTheBunny, BigBadWolf );
and

interactPtr = TypicalRabbitlnteractions; Ok = (*interactPtr) ( BasilTheBunny, BigBadWolf );

Using function pointers inside classes is a little trickier, but not by much. The key idea is to declare the function generated by the script compiler to be a friend of the class, so that it can access its private data members, and to pass it the special pointer this, which represents the current object, as its first parameter.
class SomeEntity : public Entity {

// The function pointer void ( * friendptr )( Entity * me, Entity * target );

public:
// Declare one or more strategy functions as friends, friend void Strategy! ( Entity * me, Entity * target );

// The actual operation void Handlelnteractions( Entity * target )


{

(*friendptr) ( this, target );

60

Section 1

General Programming

Basically, this is equivalent to doing by hand what the C++ compiler does for you when calling class methods: the C++ viable secretly adds "this" as a first parameter to every method. Because any modern compiler will inline the function calls, there should be no performance differential between calling a compiled script with this scheme and calling a regular method. Note that picking and choosing strategies at runtime through function pointers is also a good way to reduce the number of behavioral classes in the hierarchy. In extreme cases, a single Entity class containing nothing but function pointer dereferences for strategy elements may even be able to replace the entire hierarchy. This, however, runs the risk of obfuscating the code to a point of total opacityproceed with caution. Finally, if an entity is allowed to switch back and forth between several alternative strategies depending on runtime considerations, this scheme allows each change to be implemented through a simple pointer assignment: clean, fast, no hassles.

Final Notes
In simple cases, the techniques described in this gem can even provide a satisfactory alternative to scripting altogether. Smaller projects that do not require the full power of a scripting language and/or cannot afford the costs associated with it may be able to get by with a set of hard-coded strategy snippets, a simple GUI-based SAMMy editor, and a linear level-description file format containing key-value tuples for the behaviors attached to each entity.
<EntityName BasilTheBunny> <ExportedClass Rabbit> <StrategyVsEntity BigBadWolf Avoid> <HandleCollision BigBadWolf Die>
i

The companion CD-ROM contains several component classes and examples of the techniques described in this gem. You will, however, have to make significant ON THE co modifications to them (for example, add your own 3D models to SAMMy) to turn them into something useful in your own projects. Finally, the text file formats used to load SAMMy and other objects in the code are assumed to be the output of a script compiler, level editor, or other associated tools. As such, they have a rather inflexible structure and are not particularly human friendly. If they seem like gibberish to you, gentle readers, please take a moment to commiserate with the poor author who had to write and edit them by hand. ;-)

References
[Brown98] Brown, W.H., Malveau, R.C., McCormick III, H.W., Mowbray, T.J., Anti Patterns: Refactoring Software, Architectures and Projects in Crisis, Wiley Computer Publishing, 1998.

1.8 A Game Entity Factory

61

[GoF94] Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1994), Design Patterns: Elements of Reusable ObjectOriented Software, Addison-Wesley, 1994. [Rising98] Rising, L. ed., The Patterns Handbook: Techniques, Strategies and Applications, Cambridge University Press, 1998.

1.9
Adding Deprecation Facilities to C++
Noel Llopis, Meyer/Glass Interactive
[email protected]

uring the lifetime of a piece of software, function interfaces are bound to change,
*become outdated, or be completely replaced by new ones. This is especially true

for libraries and engines that are reused across multiple projects or over several years. When a function that interfaces to the rest of the project changes, the game (or tools, or both!) may not compile any more. On a team working on a large project, the situation is even worse because many people could be breaking interfaces much more often.

Possible Solutions
There are different ways of dealing with this situation: Don't do anything about it. Every time something changes, everybody has to update the code that calls the changed functions before work can proceed. This might be fine for a one-person team, but it's normally unacceptable for larger teams. Don't change any interface functions. This is not usually possible, especially in the game industry where things change so quickly. Maybe the hardware changed, maybe the publisher wants something new, or perhaps the initial interface was just flawed. Trying to stick to this approach usually causes more harm than good, and ends up resulting in functions or classes with names completely unrelated to what they really do, and completely overloaded semantics. Create new interface versions. This approach sticks to the idea that an interface will never change; instead, a new interface will be created addressing all the issues. Both the original and the new interface will remain in the project. This is what DirectX does with each new version. This approach might be fine for complete changes in interface, or for infrequent updates, but it won't work well for frequent or minor updates. In addition, this approach usually requires maintaining the full implementation of the current interface and a number of the older interfaces, which can be a nightmare. In modern game development, these solutions are clearly not ideal. We need something else to deal with this problem.
62

1.9 Adding Deprecation Facilities to C++

63

The Ideal Solution


What we really want is to be able to write a new interface function, but keep the old interface function around for a while. Then the rest of the team can start using the new function right away. They may change their old code to use the new function whenever they can, and, after a while, when nobody is using it anymore, the old function can be removed. The problem with this is how to let everybody know which functions have changed and which functions they are supposed to use. Even if we always tell them this, how are they going to remember it if everything compiles and runs correctly? This is where deprecating a function comes in. We write the new function, and then flag the old function as deprecated. Then, every time the old function is used, the compiler will generate a message explaining that a deprecated function is being called and mentioning which function should be used in its place.

Using and Assigning Deprecated Functions


Java has a built-in way to do exactly what we want. However, most commercial games these days seem to be written mostly using C++, and unfortunately C++ doesn't contain any deprecation facilities. The rest of this gem describes a solution implemented in C++ to flag specific functions as deprecated. Let's start with an example of how to use it. Say we have a function that everybody is using called FunctionAQ. Unfortunately, months later, we realize that the interface of FunctionAQ has to change, so we write a new function called NeivFunctionAQ. By adding just one line, we can flag FunctionAQ as deprecated. int FunctionA ( void )
{ }

DEPRECATE ( "FunctionA()", "NewFunctionA()" ) // Implementation

int NewFunctionA ( void )


{

// Implementation

}
The line DEPRECATE("FunctionA()", "NewFunctionAQ") indicates that FunctionAQ is deprecated, and that it has been replaced with NewFunctionAQ. The users of FunctionAQ don't have to do anything special at all. Whenever users use FunctionA() they will get the following message in the debug window when they exit the program: WARNING. You are using the following deprecated functions: - Function FunctionA() called from 3 different places. Instead use NewFunctionA().

64

Section 1

General Programming

Implementing Deprecation in C++


Everything is implemented in one simple singleton class [Gamma95]: DeprecationMgr. The full source code for the class along with an example program is included on (^rs*- >^; j the companion CD-ROM. In its simplest form, all DeprecationMgr does is keep a list onma> of the deprecated functions found so far. Whenever the singleton is destroyed (which happens automatically when the program exits), the destructor prints out a report in the debug window, listing what deprecated functions were used in that session. class DeprecationMgr { public: static DeprecationMgr * Getlnstance ( void ) ; -DeprecationMgr ( void ); bool AddDeprecatedFunction (const char * OldFunctionName, const char * NewFunctionName, unsigned int CalledFrom ) ; // Rest of the declaration here Usually, we won't have to deal with this class directly because the DEPRECATE macro will do all of the work for us. #ifdef _DEBUG #define DEPRECATE ( a, b) { \ void * fptr; \ _asm { mov fptr, ebp } \ DeprecationMgr: :GetInstance()->AddDeprecatedFunction(a, b, fptr); \ } #else #define DEPRECATE(a,b) #endif Ignoring the first few lines, all the DEPRECATE macro does is get an instance to the DeprecationMgr and add the function that is being executed to the list. Because DeprecationMgr is a singleton that won't be instantiated until the GetlmtanceQ function is called, if there are no deprecated functions, it will never be created and it will never print any reports at the end of the program execution. Internally, DeprecationMgr keeps a small structure for each deprecated function, indexed by the function name through an STL map collection. Only the first call to a deprecated function will insert a new entry in the map. The DeprecationMgr class has one more little perk: it will keep track of the number of different places from which each deprecated function was called. This is useful so we know at a glance how many places in the code we need to change when we decide to stop using the deprecated function. Unfortunately, because this trick uses assembly directly, it is platform specific and only works on the x86 family of CPUs. The first two lines of the DEPRECATE macro get the EBP register (from which it is usually possible to retrieve the return address), and pass it on to AddDeprecatedFunc-

1.9 Adding Deprecation Facilities to C++

65

tionQ. Then, if a function is called multiple times from the same place (in a loop for example), it will only be reported as being called from one place. There is a potential problem with this approach for obtaining the return address. Typically, the address [EBP-4] contains the return address for the current function. However, under some circumstances the compiler might not set the register EBP to its expected value. In particular, this happens under VC++ 6.0, when compiler optimizations are turned on, for particularly simple functions. In this case, trying to read from [EBP-4] will either return an incorrect value or crash the program. There will be no problems in release mode, which is when optimizations are normally turned on, because the macro does no work. However, sometimes optimizations are also used in debug mode, so inside the function AddDeprecatedFunctionQ we only try to read the return address if the address contained in [EBP-4] is readable by the current process. This is accomplished by either using exception handling or calling the Windowsspecific function IsBadReadPtrQ. This will produce an incorrect count of functions that deprecated functions were called from when optimizations are turned on, but at least it won't cause the program to crash, and all the other functionality of the deprecation manager will still work correctly.

What Could Be Improved?


One major problem remains: the deprecation warnings are generated at runtime, not at compile or link time. This is necessary because the deprecated functions may exist in a separate library, rather than in the code that is being compiled. The main drawback of only reporting the deprecated functions at runtime is that it is possible for the program to still be using a deprecated function that gets called rarely enough that it never gets noticed. The use of the deprecated function might not be detected until it is finally removed and the compiler reports an error.

Acknowledgments
I would like to thank David McKibbin for reviewing this gem, and for identifying the problems caused by compiler optimizations and finding a workaround. References [Gamma95] Gamma, Eric, et al, Design Patterns, Addison-Wesley. 1995. [Rose] Rose, John, "How and When to Deprecate APIs," available online at java.sun.com/products/jdk/1.1 /docs/guide/misc/deprecation/deprecation.html.

1.10
A Drop-in Debug Memory Manager
Peter Da/ton, Evans & Sutherland
[email protected]

ith the increasing complexity of game programming, the minimum memory requirements for games have skyrocketed. Today's games must effectively deal with the vast amounts of resources required to support graphics, music, video, animations, models, networking, and artificial intelligence. As the project grows, so does the likelihood of memory leaks, memory bounds violations, and allocating more memory than is required. This is where a memory manager comes into play. By creating a few simple memory management routines, we will be able to track all dynamically allocated memory and guide the program toward optimal memory usage. Our goal is to ensure a reasonable memory footprint by reporting memory leaks, tracking the percentage of allocated memory that is actually used, and alerting die programmer to bounds violations. We will also ensure that die interface to die memory manager is seamless, meaning that it does not require any explicit function calls or class declarations. We should be able to take diis code and effortlessly plug it into any other module by including die header file and have everything else fall into place. The disadvantages of creating a memory manager include die overhead time required for die manager to allocate memory, deallocate memory, and interrogate die memory for statistical information. Thus, this is not an option that we would like to have enabled for the final build of our game. In order to avoid these pitfalls, we are going to only enable the memory manager during debug builds, or if the symbol ACTIVATE_MEMORY_MANAGER is defined.

Getting Started
The heart of the memory manager centers on overloading the standard new and delete operators, as well as using #define to create a few macros that allow us to plug in our own routines. By overloading the memory allocation and deallocation routines, we will be able to replace the standard routines with our own memory-tracking module. These routines will log the file and line number on which the allocation is being requested, as well as statistical information.

66

1.10 A Drop-in Debug Memory Manager

67

The first step is to create the overloaded new and delete operators. As mentioned earlier, we would like to log the file and line number requesting the memory allocation. This information will become priceless when trying to resolve memory leaks, because we will be able to track the allocation to its roots. Here is what the actual overloaded operators will look like: inline void*
operator new(size_t size, const char *file, int line); inline void* operator new[](size_t size, const char *file, int line); inline void operator delete( void *address ); inline void operator delete[]( void *address );

It's important to note that both the standard and array versions of the new and delete operators need to be overloaded to ensure proper functionality. While these declarations don't look too complex, the problem that now lies before us is getting all of the routines that will use the memory manager to seamlessly pass the new operator the additional parameters. This is where the #define directive comes into play.
#define new new( FILE , LINE ) tfdefine delete setOwner(_FILE_,_LINE_) .false ? setOwner("",0) : delete

#define malloc(sz) AllocateMemory(_FILE_,_LINE_,sz,MM_MALLOC) tfdefine calloc(num,sz) AllocateMemory(_FILE_1_LINE_,sz*num,MM_CALLOC) #define realloc(ptr,sz) AllocateMemory( FILE , LINE , sz, MM_REALLOC, ptr ) tfdefine free(sz) deAllocateMemory( FILE , LINE , sz,
MM_FREE )

The #define new statement will replace all new calls with our variation of new that takes as parameters not only the requested size of the allocation, but also the file and line number for tracking purposes. Microsoft's Visual C++ compiler provides a set of predefined macros, which include our required __FILE_ and LINE__ symbols [MSDN]. The #define delete macro is a little different from the #define new macro. It is not possible to pass additional parameters to the overloaded delete operator without creating syntax problems. Instead, the setOwnerQ method records the file and line number for later use. Note that it is also important to create the macro as a conditional to avoid common problems associated with multiple-line macros [DaltonOl]. Finally, to be complete, we have also replaced the mallocQ, callocQ, reallocQ, and the freeO methods with our own memory allocation and deallocation routines. The implementations for these functions are located on the accompanying CD. I The AllocateMemoryO and deAllocateMemoryO routines are solely responsible for all t on m CD memory allocation and deallocation. They also log information pertaining to the desired allocation, and initialize or interrogate the memory, based on the desired

68

Section 1

General Programming

action. All this information will then be available to generate the desired statistics to analyze the memory requirements for any given program.

Memory Manager Logging


Now that we have provided the necessary framework for replacing the standard memory allocation routines with our own, we are ready to begin logging. As stated in the beginning of this gem, we will concentrate on memory leaks, bounds violations, and the actual memory requirements. In order to log all of the required information, we must first choose a data structure to hold the information relevant to memory allocations. For efficiency and speed, we will use a chained hash table. Each hash table entry will contain the following information: struct MemoryNode { size_t actualSize; size_t reportedSize; void *actualAddress; void *reportedAddress; char sourceFile[30]; unsigned short sourceLine; unsigned short paddingSize; char options; long predefinedBody; ALLOC_TYPE allocationType; MemoryNode *next, *prev;

};
This structure contains the size of memory allocated not only for the user, but also for the padding applied to the beginning and ending of the allocated block. We also record the type of allocation to protect against allocation/deallocation mismatches. For example, if the memory was allocated using the new[] operator and deallocated using the delete operator instead of the delete[] operator, a memory leak may occur due to object destructors not being called. Effort has also been taken to minimize the size of this structure while maintaining maximum flexibility. After all, we don't want to create a memory manager that uses more memory than the actual application being monitored. At this point, we should have all of the information necessary to determine if there are any memory leaks in the program. By creating a MemoryNode within the AllocateMemoryO routine and inserting it into the hash table, we will create a history of all the allocated memory. Then, by removing the MemoryNode within the deAllocateMemoryO routine, we will ensure that the hash table only contains a current listing of allocated memory. If upon exiting the program there are any entries left within the hash table, a memory leak has occurred. At this point, the MemoryNode can be interrogated to report the details of the memory leak to the user. As mentioned previously, within the deAllocateMemoryO routine we will also validate that the method

1.10 A Drop-in Debug Memory Manager

69

used to allocate the memory matches the deallocation method; if not, we will note the potential memory leak. Next, let's gather information pertaining to bounds violations. Bounds violations occur when applications exceed the memory allocated to them. The most common place where this happens is within loops that access array information. For example, if we allocated an array of size 10, and we accessed array location 11, we would be exceeding the array bounds and overwriting or accessing information that does not belong to us. In order to protect against this problem, we are going to provide padding to the front and back of the memory allocated. Thus, if a routine requests 5 bytes, the AllocateMemoryO routine will actually allocate 5 + sizeofllong)*2*paddmgSize bytes. Note that we are using longs for the padding because they are defined to be 32-bit integers. Next, we must initialize the padding to a predefined value, such as OxDEADCODE. Then, upon deallocation, if we examine the padding and find any value except for the predefined value, we know that a bounds violation has occurred. At this point, we would interrogate die corresponding MemoryNode and report die bounds violation to the user. The only information remaining to be gathered is the actual memory requirement for the program. We would like to know how much memory was allocated, how much of the allocated memory was actually used, and perhaps peak memory allocation information. In order to collect this information we are going to need another container. Note that only the relevant members of the class are shown here. class MemoryManager { public: unsigned int m_totalMemoryAllocations; unsigned int m_totalMemoryAllocated; unsigned int m_totalMemoryUsed; unsigned int m_peakMemoryAllocation; }|

/ / I n bytes / / I n bytes

Within the AllocateMemoryO routine, we will be able to update all of the MemoryManager information except for the m_totalMemory Used variable. In order to determine how much of the allocated memory is actually used, we will need to perform a trick similar to the method used in determining bounds violations. By initializing the memory within the AllocateMemoryO routine to a predefined value and interrogating the memory upon deallocation, we should be able to get an idea of how much memory was actually utilized. In order to achieve decent results, we are going to initialize the memory on 32-bit boundaries, once again, using longs. We will also use a predefined value such as OxBAADCODE for initialization. For all remaining bytes that do not fit within our 32-bit boundaries, we will initialize each byte to OxE or static_cast<char>(OxBAADCODE). While this method is potentially error prone because there is no predefined value to which we could initialize the memory and ensure uniqueness, initializing the memory on 32-bit boundaries will generate far better results than initializing on byte boundaries.

70

Section 1 General Programming

Reporting the Information


we

Now that we have all of the statistical information, let's address the issue of how should report it to the user. The implementation that is included on the CD records all information to a log file. Once the user has enabled the memory manager and run the program, upon termination a log file is generated containing a listing of all the memory leaks, bounds violations, and the final statistical report. The only question remaining is: how do we know when the program is terminating so that we can dump our log information? A simple solution would be to require the programmer to explicitly call the dumpLogReport() routine upon termination. However, this goes against the requirement of creating a seamless interface. In order to determine when the program has terminated without the use of an explicit function call, we are going to use a static class instance. The implementation is as follows: class Initialize { public: Initialize() { InitializeMemoryManager(); } }; static Initialize InitMemoryManager; bool InitializeMemoryManager() { static bool hasBeenlnitialized = false; if (sjnanager) return true; else if (hasBeenlnitialized) return false; else { s_manager = (MemoryManager*)malloc(sizeof(MemoryManager)); s_manager->intialize(); atexit( releaseMemoryManager ); hasBeenlntialized = true; return true; } } void releaseMemoryManager() { NumAllocations = sjnanager->m_numAllocations; s_manager->release(); // Releases the hash table and calls free( sjnanager ); // the dumpLogReport() method sjnanager = NULL; } The problem before us is to ensure that the memory manager is the first object to be created and the very last object to be deallocated. This can be difficult due to the order in which objects that are statically defined are handled. For example, if we created a static object that allocated dynamic memory within its constructor, before the memory manager object is allocated, the memory manager will not be available for memory tracking. Likewise, if we use the ::atexit() method to call a function that is responsible for releasing allocated memory, the memory manager object will be released before the ::atexit() method is called, thus resulting in bogus memory leaks. In order to resolve these problems, the following enhancements need to be added. First, by creating the InitMemoryManager object within the header file of the memory manager, it is guaranteed to be encountered before any static objects are declared.

1.10 A Drop-In Debug Memory Manager

71

This holds true as long as we #include that memory manager header before any static definitions. Microsoft states that static objects are allocated in the order in which they are encountered, and are deallocated in the reverse order [MSDN]. Second, to ensure that the memory manager is always available we are going to call the InitializeMemoryManager() routine every time within the AllocateMemoryO and DeallocateMemoryQ routines, guaranteeing that the memory manager is active. Finally, in order to ensure that the memory manager is the last object to be deallocated, we will use the ::atexit() method. The ::atexit() method works by calling the specified functions in the reverse order in which they are passed to the method [MSDN1]. Thus, the only restriction that must be placed on the memory manager is that it is the first method to call the ::atexit() function. Static objects can still use the ::atexit() method; they just need to make sure that the memory manager is present. If, for any reason, the InitializeMemoryManagerQ function returns false, then this last condition has not been met and as a result, the error will be reported in the log file. Given the previous restriction, there are a few things to be aware of when using Microsoft's Visual C++. The ::atexit() method is used extensively by internal VC++ procedures in order to clean up on shutdown. For example, the following code will cause an ::atexit() to be called, although we would have to check the disassembly to see it.
void Foo() { static std::string s; }

While this is not a problem if the memory manager is active before the declaration of s is encountered, it is worth noting. Despite this example being completely VC++ specific, other compilers might differ or contain additional methods that call ::atexit() behind the scenes. The key to the solution is to ensure that the memory manager is initialized first.

Things to Keep in Mind


Besides the additional memory and time required to perform memory tracking, there are a few other details to keep in mind. The first has to deal with syntax errors that can be encountered when #induding other files. In certain situations, it is possible to generate syntax errors due to other files redefining the new and delete operators. This is especially noticeable when using STL implementations. For example, if we #include "MemoryManager.h"a.nd then #include <map>, we will generate all types of errors. To resolve this issue, we are going to be using two additional header files: new_on.h and new_off.h. These headers will simply #define and #undefine the new!'delete macros that were created earlier. The advantage of this method includes the flexibility that we achieve by not forcing the user to abide by a particular #include order, and avoids the complexity when dealing with precompiled headers. tfinclude "new_off.h" #include <map>

72

Section 1 General Programming ^include <string> #include <A11 other headers overloading the new/delete operators> #include "new_on.h"
^include "MemoryManager.h" // Contains the Memory Manager Module tfinclude "Custom header files"

Another issue we need to address is how to handle libraries that redefine the new and delete operators on their own. For example, MFC has its own system in place for handling the new and delete operators [MSDN2]. Thus, we would like to have MFC classes use their own memory manager, and have non-MFC shared game code use our memory manager. We can achieve this by inserting the #indude "new_off.h" header file right after the #//2&/'created by the ClassWizard.
#ifdef _DEBUG ^include "new_off.h" // Turn off our memory manager tfdefine new DEBUG_NEW tfundef THIS_FILE static char THIS_FILE[] = _FILE__; #endif

This method will allow us to keep the advantages of MFC's memory manager, such as dumping CC%>rt-derived classes on memory leaks, and still provide the rest of the code with a memory manager. Finally, keep in mind the requirements for properly implementing'the setOwnerQ method used by the delete operator. It is necessary to realize that the implementation is more complicated than just recording the file and line number; we must create a stack implementation. This is a result of the way that we implemented the delete macro. Take, for example, the following:
File 1: line 1: class B { B() {a = new int;} ~B() {delete a;} }; File 2: line 1: B *objectB = new B; File 2: line 2: delete objects;

The order of function calls is as follows:


1. new( objects, File2, 1 2. new( a, Filel, 1

3. setOwner( File2, 2 ); 4. setOwner( Filel, 1 ); 5. delete( a );


6. delete( objects );

As should be evident from the preceding listing, by the time the delete operator is called to deallocate objectB, we will no longer have the file and line number information unless we use a stack implementation. While the solution is straightforward, the problem is not immediately obvious.

1.10 A Drop-in Debug Memory Manager

73

Further Enhancements
, , Within the implementation provided on the CD accompanying this book, there are c on m CD several enhancements to the implementation discussed here. For example, there is the option for the user to set flags to perform more comprehensive memory tests. Options also exist for setting breakpoints when memory is deallocated or reallocated so that the programs stack can be interrogated. These are but a few of the possibilities that are available. Other enhancements could easily be included, such as allowing a program to check if any given address is valid. When it comes to memory control, the options are unlimited.

References
[DaltonOl] Dalton, Peter, "Inline Functions versus Macros," Game Programming Gems II, Charles River Media. 2001. [McConnell93] McConnell, Steve, Code Complete, Microsoft Press. 1993. [MSDN1] Microsoft Developer Network Library, http://msdn.microsoft .com/ Iibrary/devprods/vs6/visualc/yclang/_pluslang_initializing_static_objects.htm [MSDN2] Microsoft Developer Network Library, http://msdn.microsoft .com/library/devprods/vs6/visualc/vccore/core_memory_management_with_mf c.3a_.overview.htm [Myers98] Myers, Scott, Effective C++, Second Edition, Addison-Wesley Longmont, Inc. 1998.

1.11
A Built-in Game Profiling Module
JeffEvertt, Lithtech, Inc.
[email protected]

his gem describes the architecture and implementation of a profiling module for low-overhead, real-time analysis that supports performance counter organization so that many consumers can work together in harmony. It is designed from a game engine perspective, with many of its requirements specifically pertaining to things typically found in games. At the time of this writing, the described module is in use by a commercially available game engine. Profiling the performance of a game or engine is one of those things that everyone agrees is important, but just as often as not guesswork or quick hacks are substituted for a real game system that can gather solid data. In the long run, the time it takes to implement a clean profiling system is a wise investment. And, as with everything else, the earlier we plan for it, the easier it will be.

Profiling Basics
The basic profiling mechanism is simple: take a timestamp at the beginning of the code of interest and again at the end. Subtract the first from the second, and voila, that's how long the code took to run. We need a high-resolution counter - the Windows multimedia timer and its millisecond resolution will not cut it. If the platform is Windows on a PC, there are two high-resolution API calls we can use: QueryPerformanceCounter and QueryPerformanceFrequency. However, because the overhead of these functions is fairly high, we will roll our own, which only requires a few lines of inline assembly:
void CWin32PerfCounterMgr::GetPerfCounter( LARGE_INTEGER SdCounter) { DWORD dwLow.dwHigh; asm { rdtsc mov dwLow, eax mov dwHigh, edx }

iCounter.QuadPart = ((unsigned int64)dwHigh 32) | (unsigned int64)dwLow; }

74

1.11 A Built-in Game Profiling Module

To convert this number into seconds, we need to know the counter frequency. In this case it is equal to the CPU cycles per second. We can measure it once when the counters are enabled take a time sample, sleep for at least 500ms, and then take another sample. Note that similar counters are available if the target platform is a game console.

Commercially Available Tools


Performance tuning is definitely a case where choosing the right tool for the job can make all the difference. There are many time-tested commercial tools available for the PC that sample an application as it runs, then offline allow profile data to be viewed module-by-module, function-by-function, and just about any other imaginable way. Intel VTune and Metrowerks Analysis Tools both make use of the built-in CPU hardware counters to generate post-processed profiles of runtime sections of a game. Tuning assembly code by instruction ordering or pairing prediction is definitely a strength of VTune. The Intel Graphics Performance Toolkit (GPT) provides some powerful scene analysis tools. It hooks in and snoops traffic at the layer between your application and Direct3D/OpenGL. Knowing exactly what is being drawn can at times be very helpful. Changing the order or the way in which the game renders can sometimes significantly affect performance. However, the GPT is written to a specific version of DirectX, so its releases usually trail that of DirectX. Also, taking any significant scene data will slow down the application, so relying on the performance characteristics of data taken when using the GPT can be dangerous. Statistics-gathering drivers for graphics cards and hardware counters can be invaluable. Nvidia releases special drivers and a real-time data viewing application that hooks all of the function entry points of the drivers. If the graphics driver is taking a significant percentage of CPU time, this application will allow us to look inside and break it down further. Intel provides counters in its drivers and hardware for its i740 chip, allowing optimization for stalls all the way down to the graphics chip level. Some of the game consoles also provide this ability. It can be very useful, as it is the only way to break down performance at this low level. It does, however, require a fair amount of knowledge about how the drivers and chips operate, and what the counters really mean.

Why Roll Our Own?


Reason one: frame-based analysis. Games typically have a fairly high frame-to-frame coherency, but in just a matter of seconds can drastically change. Imagine a 3D shootera player starts facing a wall, runs down a long corridor, then ends it all in a bloody firefight with five Al-driven enemies. The game engine is running through many potentially different bottlenecks that can only really be identified with a frameby-frame analysis. Looking at a breakdown of an accumulated sample over the entire

76

Section 1

General Programming

interval gives an inaccurate view of what is really going on. Frame-based analysis allows focusing on one problem at a time. Reason two: it can be done anytime and anywhere. At the end of a PC game development cycle, someone will probably be faced with performance problems that only manifest themselves on someone's brother's machine, on odd Tuesdays. There are typically a significant number of these types of problems. They can cost a lot of time and can very easily slip the release date. Although this type of problem is unique to PC games, console games still have to deal with the "shooting a missile in the corner of level three grinds the game to a slow crawl" types of problems. Once the problem is understood, figuring out the solution is usually the easy part. If we could walk over to that test machine and pop up a few counter groups, we would quickly nail down the culprit. Reason three: customizability. Modern game engines are complicated. The ability to ignore all the other modules in the engine except for the one being working on is powerful. In addition, the only person that can organize the data exactly how they want it is the engineer actually doing the work.

Profile Module Requirements


Requirement one: allow users to quickly and accurately profile the application. Requirement two: be non-obtrusive (that is, have very low overhead). When the cost for taking samples and displaying the results becomes a significant portion of die frame time, it can actually change the application's behavior within the system. In general, slowing down the CPU will tend to hide stalls caused by graphics cards. While even a very small percentage can in some rare cases drastically change game performance, as a general rule, when the profiler is enabled, it should take less than five percent of the total CPU cycles. When disabled, it should be much less dian one percent. Requirement three: allow multiple users to work independently on their respective systems without having to worry about other engine modules. Requirement four: when it's not needed, it should be well out of the way.

Architecture and Implementation


A performance counter manager (IPerfCounterMan) keeps track of all active and inactive counters. The counters are organized into groups of similar type (for example, model render, world render, AI, physics) that are enabled and disabled together. This supports the notion of multiple groups working independently in an easy to understand grouping concept. Groups are useful for two reasons: for quickly determining if a counter needs to be sampled, and for enabling and disabling groups of counters to be displayed. We will make use of four-character codes (FourCC's) for the group ID and full text strings for counter names. The entire system is organized into a module with an interface to the rest of the system. The basic component is a counter that is identified by a group ID (its

1.11 A Built-in Game Profiling Module

J77

FourCC) and its string name. Each counter is given an integer ID on creation that uniquely identifies it. In typical usage, the game code creates counters on initialization and puts start/stop counter calls around the code to be profiled. The basic functional unit interface for the module is as follows: class IPerfCounterMan { public: // Add new counter (returns the ID, 0 is failure) int32 AddCounter(uint32 CounterGroup, const char* szCounterName); // Forget your counter's ID? (Zero is failure) int32 GetCounterID(uint32 CounterGroup, const chan* szCounterName); // Delete the counter bool DeleteCounter(uint32 Counter-ID); // Start and Stop a counter. void StartCounter(uint32 Counter-ID); void StopCounter(uint32 CounterlD); // Draw the Counters onto the Screen (to be called once // per frame near the end of the scene) void DrawCounters();

};
StopCounter calculates the difference between the StartCounter and StopCounter calls and keeps a running total. On DrawCounters, all the running counters are cleared. A maximum value is also maintained and is set at the end of the frame in DrawCounters. Let's assume that our engine has a debug console that accepts text commands. It is a very convenient way to enable and disable counter groups and to allow customization of the display. It is very helpful to allow as much configuration in the counter display as possible. We will most likely not want to refresh the counter display every frame (updates every 30 frames should be sufficient), but depending on what is being debugged, the ability to customize the refresh time can be very handy. In addition, displaying both the current percentage and the maximum percentage since last displayed is useful. A bar graph is a good way to display the result. It gives the consumer a quick feel for the numbers and isn't hard to code. The ability to switch from percentage to actual time (in milliseconds), display the time or percentage as text values, and auto-scale the axes is also very useful. Be careful about switching the axis scale very often, especially without some kind of warning, because it will likely just confuse people.

78

Section 1

General Programming

Implementation Details
The interface to the performance counter manager should be flexible and easy to use. Consumers of the profile manager will often find it easier to simply call AddCounter(...) with the full string, get the ID, and start it up all at once instead of saving the counter ID at some one-time initialization point. Providing this mechanism can help out when doing some quick profiling. However, it's not as efficient, and calling it many times in a frame will add up quickly. Also, supplying a class that can be placed at the beginning of a function that calls StartCounter in the constructor and StopCounter in the destructor (when it goes out of focus) can be a handy way to instrument the counters. When writing the profiling manager, it's best to provide some kind of #define macro that completely removes the profiler. When it comes down to getting peak performance out of a game, profiling code is often one of the first things to go. We need to provide macros for AddCounter, StartCounter, and StopCounter that completely compile out on an #ifdefdnan%t. Also, it's best to use colors for visual cues. When the counters are being displayed, it's easier to read if we use different colors on each line.

Data Analysis
Be sure to profile the release build, because it can have a very different set of bottlenecks from the debug version. If the target platform is the PC, it is also a good idea to pick two or three typical system configurations (low to high end) and profile each of them. Bottlenecks can vary greatly across system configurations. The game should be profiled in the areas that have performance problems as well as during typical game play. We must break the problem down, try to focus on one thing at a time, and focus on the areas that will give the biggest bang for the buck. Just because a function is called the most often or takes the most CPU time doesn't mean it is the only place we should focus our efforts. Often, the only thing we can compare our cycle times with is our expectations, and realistic expectations are usually gained only through experience. The profiler itself should also be profiled. If the act of profiling is intrusive, it changes the behavior of your game. There should be a counter around the profiler's draw routines.

Implementation Notes
The described module has been implemented across multiple platforms. However, parts of it require platform-dependent functions. The actual timestamp query and the draw functions will mostly likely need to be implemented in platform-dependent code, so it's best to design a level of abstraction around those functions. The described implementation uses a set of debug geometry and text (which has a platform-

1.11 A Built-in Game Profiling Module

79

dependent implementation) in the draw code so that it can be platform independent. You may need to write a macro to create your four character code values, as many compilers do not have support for them. This same system can be used to take long running profiles of a game server to detect problems. All the counters go through one source, so data can easily be filtered down and saved to disk.

1.12
Linear Programming Model for Windows-based Games
Javier F. Otaegui, Sabarasa Entertainment
[email protected]

n the past, when DOS ruled the earth, we programmed our games in a mostly linear fashion. Then it was time to port our creations from DOS to DirectX, and this was a big jump because of the Windows message pump. Its architecture is simply not adequate for game programming. In this gem, we will cover an effective way to encapsulate the message pump, provide a linear programming model and, as a very desirable side effect, allow correct "alt-tab" application switching. We will also cover correct recovery of lost surfaces. If you have previously programmed linearly, you will easily understand the importance of the method introduced in this gem. If your experience in game programming started with Windows, then you might find the message pump a natural environment for game programming, but once you try linear programming, you will never go back to the message pump. It is far clearer and easier to follow and debug than a huge finite state machine is. You can save a lot of design, programming, debugging time, and thinking if you start working in a more linear way.

Updating the World


Modern games often have some sort of UpdateWorld function, located in the heart of the application in the message pump, and invoked whenever it is not receiving any messages. In a first attempt, coding an UpdateWorld function can be very simple: all the application variables, surfaces, and interfaces have already been initialized, and now we just have to update and render them. That should be an easy task, but only if we plan that our game will have only one screen, no cut-scenes, no menus, and no options. The problem is that UpdateWorld must eventually finish and return to the message pump so we can process messages from the system. This prevents us from staying in a continuous for loop, for example. As old DOS games didn't have to return constantly to a message pump to process system requests, we could linearly program
80

1.12 Linear Programming Model for Windows-based Games

81

them, and our subroutines could have all the loops they needed, or delays, or cutscenes. We simply had to insert the corresponding code into the subroutine. Now, however, with the message pump, which requires constant attention, we must return on every loop. As stated previously, the problem of returning in every single loop is when attempting to maintain several game screens. The way to work around this is to make every subroutine of the application a finite state machine. Each subroutine will have to keep track of its internal state, and, according to this state, it must invoke several other subroutines. Each of these other subroutines is also a finite state machine, and when it finishes its execution (that is, it has no more states to execute), it must return a value to inform the invoking subroutine that it can proceed with its own following state. Of course, each subroutine, when it finishes, must reset its state to 0, to allow the application to invoke it again. Now if we imagine 30 or 40 of these subroutines, each with a couple dozen states, we will be facing a very big monster. Trying to debug or even follow this code will be difficult. This finite-state programming model is far more complicated that the simple model achieved by old linear DOS programs.

The Solution: Multithreading


Here is a simple multithreading model that frees the game programmer from the message pump and its potentially undesirable finite-state programming model. Windows supports multithreading, which means that our application can run several threads of execution simultaneously. The idea is very simple - put the message pump in one thread and the game into another one. The message pump will remain in the initial thread, so we can take out the UpdateWorld function from the message pump and return it to its simplest form (a linear programming scheme). Now we just need to add to the dolnit function the code necessary to initiate the game thread.
HANDLE hMainThread; static BOOL dolnit( ... ) { // Main Thread handle

... // Initialize DirectX and everything else

DWORD tid; hMainThread=CreateThread( 0, 0, &MainThread, 0, 0, &tid); return TRUE;

}
MainThread is defined by:

82

Section 1

General Programming

DWORD WINAPI

MainThread( LPVOID argl ) { RunGame() ;

PostMessage(hwnd, WM_CLOSE, 0, 0); return 0; Main Thread will invoke our RunGame function, and when it is finished, we just post a WM_CLOSE message to tell the message pump thread to finish execution. Initialization Code Now we must choose whether to include initialization code (including the DirectX initialization code) in the dolnit function or directly into our RunGame function. It may be more elegant to include it in the dolnit function, as long as we include all terminating code in the response to WM_CLOSE in our message handler. On the odier hand, we could include all the initialization code in the RunGame function, which means that we will handle all of the important parts of the code directly in our new linear game-programming function.
The "Alt-Tab" Problem

Making a game truly multitasking under Windows is perhaps one of the most hazardous issues in game programming. A well-behaved application must be able to correcdy switch to other applications. This means allowing the user to alt-tab away from the application, which some games attempt to disallow, but we will try to make things work correcdy. We could try using the standard SuspendThread and ResumeThread functions, but it's nearly impossible to get this to work properly. Instead, we will use a multithreaded communication tool: events. Events work like flags that can be used to synchronize different threads. Our game thread will check if it must continue, or if it must wait for the event to be set. On startup, we must create a manual-reset event. This event should be reset (cleared) when the program is deactivated, and set when the program is reactivated. Then, in the main loop, we just have to wait for the event to be set. To create the event, we need this global:
HANDLE task_wakeup_event;

To create and set the event, we need to include the following code during initialization:
task_wakeup_event = CreateEvent( NULL, / / N o security attributes TRUE, // Manual Reset ON FALSE, // Initial state = Non signaled NULL // No name

1.12 Linear Programming Model for Windows-based Games

83

Most games have a function in their main loop that is called every time the game needs to render a new screen; this is typically where DirectX is asked to flip the primary and back buffers. Because this function is called constantly, this is an ideal location to make the thread wait for the event in an idle state, using this code:
WaitForSingleObject( task_wakeup_event, INFINITE );

We must suspend the thread every time the operating system switches the active application. To do this, we must have our WindowProc function, and upon receiving an APP_ACTIVATE message, check whether the application is active. If the application has gone to an inactive state, we must suspend the game execution, which requires this call:
ResetEvent( task_wakeup_event );

and to resume it:


SetEvent( task_wakeup_event );

With this simple implementation, when the user hits alt-tab, the game will temporarily stop execution, freeing all the processor's time to enable die user to do other tasks. If the world update must continue executing even if the application loses focus, then we can just suspend the rendering pipeline, and continue updating the world. This event model can be used with any number of threads that the application may require, by inserting new events for each new thread.
Handling Lost Surfaces

If we use video memory surfaces, we will face the problem of losing the surface information when the application loses focus. The problem that we now face is that with our new linear programming model, a program can be caught in the middle of a subroutine with all its surfaces lost. There are many possible solutions to this situation, one of which is the Command pattern [GoF94]. Unfortunately, it obscures our code, and the main goal of this gem is to make things more clear. We can use a global stack of pairs of callback functions and If Voids, which will be called when the surfaces need to be reloaded. When we need to restore the surfaces, we would invoke callback_function( IpVoid ). The If Void parameter can include pointers to all the surfaces that we need, so we can keep the surfaces local to our new linear subroutines. Let's suppose that we have a subroutine called Splash that displays a splash screen in our game, which is a surface loaded from a file. If the user hits alt-tab while the splash screen is displayed and then comes back, we want our application to show the splash screen again (let's assume that the surface was lost while the application was inactive). Using our proposed method, we must do something like this:
int LoadSplashGraphi.es ( Ipvoid Params )

Section 1 General Programming

Surface *pMySurface; pMySurface = (Surface *) Params;


// (load the graphic from the file)

return 1; int Splash() Surface MySurface; // Push the function gReloadSurfacesStack.Push( &LoadSplashGraphics, &MySurface ); / / D o not forget to load graphics for the first time LoadSplashGraphics( &MySurface ); // ... the subroutine functionality. // Pop the function gReloadSurfaceStack.Pop();

We are using a stack so that each nested subroutine can add all the surface loading and generation code that it might need. The implementation could easily be changed to another collection class, but this is a classic stack-oriented problem due to its nested functionality, and so a stack works best here.

References
[GoF94] Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1994), Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley. 1994. Otaegui, Javier E, "Getting Rid of the Windows Message Pump," available online at www.gamedev.net/reference/articles/articlel249.asp.

1.13
Stack Winding
Bryon Hapgood, Kodiak Interactive
[email protected]

tack winding is a powerful technique for assembly programmers that allows us to modify an application's stack to do weird and wonderful things that can be extended into C/C++ with very little work. While the days of writing every line of game code in hand-optimized machine language are over, sometimes it is worth the effort to dip into the realm of the arcane to get that extra bit of speed and elegance in a game. In this gem, we cover one particular type of stack winding that I call the "temporary return." This is the bare minimum form that we will build upon in subsequent examples until we have a thunked temporary return. The code examples have been tested with Microsoft's MASM and Visual C++ compiler. I have personally used stack winding in a number of projects for the GameBoy Color, PC, and Xbox.

Simple TempRet
Stack winding, as its name implies, is a technique for modifying the stack to make it do unexpected things. The term stack winding comes from the idea of inserting values in an existing stack frame to change its normal and expected behavior.

Listing 1.13.1 The TempRet routine


0

.586

2 3 4 5 6 7 8 9 10 11 12 13 14 15

.model flat .data buffer dd ? file_handle dd ? filesize dd ? .code _TempRetEg: call call
)

fnO fn1

; before j pop edx


85

Section 1 General Programming


16 17 18 19 20 21 22 23 24 25 26

call edx
1

; after call fn2 call fn3 ret A: call _TempRetEg ret

27 end
In Listing 1.13.1, we see the first building block of stack winding: the TempRet routine. Let's take a function (call it MyFunc) and say it calls _TempRetEg. The latter then calls two functions: fnO and fnl. It then hits the lines:

pop edx call edx


Now we know that the way the CPU handles the assembly CALL instruction on line 24 is to push the address of the next line (25) and execute a JUMP to line 8. Line 15 pops that address off the stack and stores it in a CPU register. Now we CALL that address. This pushes line 20 onto the stack and executes a JUMP to line 25. The latter does nothing but execute a CPU return, which pops an address off the stack and jumps there. The rest of _TempRetEg then continues and when it returns, we do not return to MyFunc but to whatever function called MyFunc in the first place. It is an interesting little trick, but why would it be important? The power comes when we consider the functions FNO through FN3. Let's say that FNO opens a file, FN1 allocates a buffer and reads the file into memory, FN2 frees that memory, and FN3 closes the file. Thus, MyFunc no longer has to worry about the release steps. It doesn't have to close the file or worry about freeing up the memory associated with the file. Functionally the process of opening a file, reading it into memory, freeing that memory, and closing the file is all contained within a single block of code. MyFunc only has to call _TempRetEg, use the buffer, and return.

TempRet Chains
The TempRet example comes of age when we chain functions together. Let's take a classic problem: the initialization and destruction of DirectX 7. This usually takes a number of steps, but it's incredibly important to release the components of DX in reverse order, which can sometimes become horribly complicated. So, let's expand our first example to illustrate this:

1.13 Stack Winding

87

Listing 1.13.2 Winding multiple routines onto the stack


0
.586

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
37 38

.model flat .code TempRet macro pop edx call edx TempRet endm createWindow: ; open the window TempRet ; close it net setCooperativeLevel: ; set to exclusive TempRet ; restore ret changeDisplayMode: ; set 640x480 16 bpp TempRet ; restore ret createSurfaces: ; create primary surface ; get attached back TempRet ; release primary ret _SetupDX7: call createWindow call setCooperativeLevel call changeDisplayMode call createSurfaces jmp _SomeUserRunFunc
end

By performing numerous TempRets in succession we effectively have wound four routines onto the stack so that when _SomeUserRunFunc returns, we will bounce back through createSurfaces, changeDisplayMode, setCooperativeLevel, and createWindow at the line after the TempRet in reverse order. So far, we've been using assembly language, but it's not necessary to write assembly modules to use this technique. We will cover two mechanisms in Microsoft's Visual C++ in the final section that aid us in stack winding: inline assembly and naked functions.

88

Section 1 General Programming

Thunking
The ideas discussed so far need to be translated into C/C++. As stated previously, Visual C++ has a handy mechanism for doing diis, but what about other compilers? If naked functions are not supported, then we will have to dip into assembly language because the presence of a stack frame really complicates things. It is not impossible, just difficult. Thunking is a technique popularized by Microsoft for slipping a piece of code between two others. In effect, program flow is hurtling along through our code until thunk!it crashes into that layer. Thunks are a great way of implementing a stackwinding paradigm in C++. Let's look at an example that performs the same task of setting up DirectX as we saw earlier:

Listing 1.13.3 Visual C++ example using TempRet


#define TempRet\ _ asm{pop edx}\ __ asm{call edx} tfdefine NAKED void _ declspec(naked) tfdefine JUMP _ asm jmp tfdefine RET _ asm ret static NAKED createWindow(){ // open the window TempRet // close it RET static NAKED setCooperativeLevel(){ // set to exclusive TempRet // restore RET static NAKED changeDisplayMode(){ // set 640x480 16 bpp TempRet // restore RET static NAKED createSurfaces(){ // create primary surface // get attached back TempRet // restore RET

1.13 Stack Winding NAKED SetUpDX7(){ createWindow(); setCooperativeLevel(); changeDisplayMode(); createSurfaces(); JUMP run

89

Recursion
As a final example of the power of stack winding, we will explore a solution for a classic problem with recursive searching: how to roll back die recursion. In regular C we would simply return repeatedly, walking back through the stack until we reach the top. If our recursion is over 100 calls deep, however, this might take a little time. To fix this, here is a pair of utility functions called SafeEnter. Incidentally, the code works just as well from a C++ object as a global function.

Listing 1.13.4 The SafeEnter and SafeExit functions that aid recursion
.586

.model flat .code public SafeEnter,SafeExit ; struct SAFE{ ; void*_reg[8]; ; void* ret; I} ; assembly for SafeEnter routine _SafeEnter: pop edx ; return address mov eax,[esp] ; safe

mov [eax].safe. mov [eax].safe. n)ov [eax].safe. mov [eax].safe. mov [eax].safe. mov [eax].safe.

ret,edx ebx.ebx ebp,ebp esp,esp esi,esi edi.edi

pop eax ; safe pointer pop edx ; call function push eax ; safe pointer

90 mov call mov jmp _SafeExit: pop edx pop eax ; ; return regs context ebp,eax edx eax,ebp sex

Section 1 General Programming

mov mov mov mov mov mov mov

edi,[eax].safe. esi,[eax].safe. esp,[eax].safe. ebp,[eax].safe. ebx,[eax].safe. edx,[eax].safe. eax,[eax].safe.

edi esi esp ebp ebx ret eax

jmp end

edx

SafeEnter works by saving off to a SAFE structure a copy of crucial CPU registers. It then calls our recursive function. As far as the function is concerned, no extra work is necessary. Now the cool part comes when we find the piece of data we're looking for. We simply call SafeExit() and pass it the register context we built earlier. We are instantly transported back to the parent function. Now, if the unthinkable happened and the search routine did not meet its search criteria, then the function can simply return in the normal way, all the way up the chain.

Listing 1.13.5 Recursive example using SafeEnter and SafeExit


static void search(SAFE&safe,void*v){ if(<meets_requirement>) SafeExit(safe); // do stuff search(safe,v); return; } int main(){ SAFE safe; SafeEnter( safe, search, <some_pointer>)

1.14
Self-Modifying Code
Bryon Hapgood, Kodiak Interactive
[email protected]

elf-modifying code, also known as "RAM-code," is a fascinating technique that actually allows a program to alter its own code as it executes. It has been used in everything from genetic algorithms to neural networks with amazing results. In games it can be used as a powerful optimization technique. Recently I used this technique on a GameBoy Color title Test Drive Cycles to decompress artwork on the fly at 60 fps, decode 14 palettes of color information (instead of the standard eight), and enable multiple levels of parallax scrolling. In this gem, we will cover how to write selfmodifying applications.

The Principles of RAM-Code


RAM-code is a simple idea, but one that can take an inordinate amount of time to get just right. It is written for the most part in hexadecimal and can be difficult to debug. Let's look at a very simple case. We want to load a pointer from a 16-bit variable stored somewhere in RAM.
getjil: Id hl,ptr_var Load HL register with the address ptr_var

Id a,(hli) Id h,(hl) Id l,a ; Return

Load A register with low byte and increment HL Load L register with high byte of ptr_var Save low byte into L

ret

This example can be improved by writing it as:


getjil: db $2a ptr_var ; Id hi,...

dw $0000
ret

; ...ptr_var

These two routines are logically no different from each other, but can you see the difference? The second example is stating the variable that stores the address to be loaded in HL as an immediate value! In other words, instead of physically going out and loading an address, we just load HL. It's quicker to load an immediate value than
91

92

Section 1 General Programming to access main memory, and because there are fewer bytes to decode, the code runs much faster. We can take that idea much further when it comes to preserving registers. Instead of pushing and popping everything, which can be expensive, we simply write the value ahead into the code. For example, instead of writing:
get_hl: Id Id Id Id Id push : hl,ptr_var a,(hli) l,(hl) h,a a,(hi) af ; Save A register

j ; do something with A j pop af ; Restore A register


ret

this code can be optimized down to:


getjil: db ptr_var dw Id Id $2a ptr_var a,(hi) (var1),a ; Id hi,... ; ...ptr_var

i ; do something with A
varl
db db ret
$2F $00

; Id a , . . . ; ...saved register value

This is not a huge saving, but it illustrates the point.

A Fast Bit Blitter


In many games, it is often crucial to convert from one pixel format to another, such as from 16-bit (565) RGB to 24-bit RGB. Whether this is done in some offline tool or within the game itself can be satisfied with this one routine. We can define a structure (call it BITMAP) that contains information about an image. From this, our blitter can then use RAM-code techniques to construct an execute-buffera piece of code that has been allocated with malloc and filled with assembly instructions. The blitter works by taking a routine that knows how to read 16-bit (565) RGB pixels and convert them to 32-bit RGBA values, and a routine that knows how to write them in another format. We can paste these two functions together, once for images with odd widths, or multiple times in succession to effectively unroll our loop. The example shown next takes the former approach. So, let's define our bitmap structure and associated enumerated types.

1.14 Self-Modifying Code

93

enum Format{ RGB_3x8=0 , RGB_565=4, RGB_555=8, RGB_4x8=12, RGB_1x8=16

struct BITMAP{ void *pixels; u32 w, h, depth; TRIPLET *pal; Format pxf; u32 stride; u32 size; BITMAP(); BITMAP ( int , int , int , Format , int n=1 ) ; void draw( int , int , BITMAP&, int , int , int , int ) ; operator bool(){ return v!=NULL;

Now, it's really important that we have the same structure defined on the assembly side of things.
BITMAP

pixels w h depth pal pxf stride size BITMAP

struct dd ? dd ? dd ? dd ? dd ? dd ? dd ? dd ? ends
= = = = =

PF_BGR_3x8 PF_BGR_565 PF_BGR_555 PF_BGR_4x8 PF BGR 1x8

OOh 04h 08h OCh 10h

The next step is to define our execute buffer. execute_buffer db 128 dup(?)

For this code to work in C++, we must use a mangled C++ name for the member function BITMAP: :draw. After that comes some initialization code:
?draw@BITMAP@<aQAEXHHAAU1 @HHHH@Z:

push lea

ebp ebp,[esp+8]

get arguments address

94 push push push ebx edi esi

Section 1 General Programming

mov edi,ecx mov esi,[ebp+8] mov eax,[esi].bitmap.pxf

;dst bitmap ;src bitmap

The first thing we must decide is whether we need to do a conversion at all. Therefore, we test to see if the two pixel formats of the bitmap objects are the same. If so, we can further ask whether they are the same size. If that is the case, we can just do a fast string copy from one to the other. If not, but they're the same width, then we can still do the string copy. If the two have different widths, then we can do string copies line by line. mov edx,[edi].bitmap.pxf cmp eax,edx jne dislike like copy mov ecx,[esi].bitmap._size cmp ecx,[edi].bitmap._size je k3 mov ecx,[edi].image.stride mov edx,[esi].image.stride cmp edx,ecx jne @f ) ; same w different h
i

mov edx,[edi].image.h mov eax,[esi].image.h cmp eax,edx jl k2 mov eax,edx k2: mul ecx mov ecx,eax k3: mov esi,[esi].image.Ifb mov edi,[edi].image.Ifb shr ecx,2 rep movsd jmp ou

, ; find smallest h -> ebx j mov eax,[edi].image.h mov ebx,[esi].image.h cmp ebx,eax jl @f mov ebx,eax
; calc strides

1.14 Self-Modifying Code

95

add mov mul mov sub sub i ; push push push call pop pop pop

ebp,12 eax, [ebp].rectangle.w [esi].image.depth; edx corrupts edx,[esi].image.stride ecx,eax edx,eax calc offsets with intentional reg swap
eax ecx edx calc_esdi edx eax ecx
edx=dest pad

ou:

shr mov rep lea lea mov dec jne pop pop pop pop ret

ecx, 2
ebp,ecx movsd edi,[edi+eax] esi,[esi+edx] ecx,ebp ebx @b

esi
edi ebx ebp 1ch

If the two bitmaps have completely different pixel formats, we have no choice but to convert every single pixel from one format to the other. The following code shows this in action. There's another way to further improve this routine by unrolling the loopthis would be as simple as repeating the build step four or more times. dislike

:lea
add push push push push push mov mov

eax,execute_buffer ebp,12

ou
eax edi

esi
ebp ebx,edi edi,eax

destination image

write "mov ebx,h" mov stosb mov al,OBDh eax,[ebp].rectangle.h

96
stosd ; write "mov ecx.w" mov stosb mov stosd ; al,OB9h eax,[ebp].rectangle.w get read

Section 1 General Programming

mov mov mov mov lodsd mov add rep

edx ,22 ebp ,esi; source eax ,[ebp] . image. pf esi , rtbl_conv[eax] ecx , eax edx ,ecx movsb put write eax ,[ebx] . image. pf esi ,wtbl_conv[eax] ecx , eax edx ,ecx movsb

;
mov mov lodsd mov add rep

;
mov push sub neg shl or mov stosd mov push mov mul sub
JZ

write tail
ecx ,[esp] edx dl,19 dl edx ,16 edx ,08D007549h eax , edx eax ,[ecx] . rectangle. w eax ecx ,[ebp] .image. stride [ebp] .image. depth ecx ,eax @f mov al,OB6h stosb mov eax, ecx stosd jmp pq
start of exec_tail args source

;
dec mov

modify outer branch


edi eax ,[esp+4]

1.14 Self-Modifying Code

97

pq:

sub mov pop mul mov sub jz

pr:

pop sub inc neg shl or

eax,6 [esp+4] ,eax eax [ebx] .image. depth ecx, [ebx] .image. stride ecx, eax @f mov ax,OBF8Dh stosw mov eax, ecx stosd pop eax jmp pr eax eax, 6 al al eax, 16 eax,OC300754Dh stosd pop ebp pop esi pop edi

dest

Another important step in this blitter is to correctly calculate the x and y offsets into the source and destination images. This routine does exactly that.
calc esdi:

; j mov mul mov mov mul mov add add

Destination eax,[ebp-12].point.x [edi].image.depth ecx,eax eax,[ebp-12].point.y [edi].image.stride edi,[edi].image.Ifb edi,ecx edi,eax Source

get dx multiply by d store result get dy multiple by stride get target pixels add x add y

mov mul mov mov mul mov mov add add ret

eax,[ebp].rectangle.x [esi].image.depth ecx,eax eax,[ebp].rectangle.y [esi].image.stride edx,[esi].image.pal esi,[esi].image.Ifb esi,ecx esi,eax

get sx multiply by d store result get sy multiple by stride palette info get target pixels add x add y

98

Section 1 General Programming

For this whole RAM-code idea to work, we need some initialization that gets placed at the top of the RAM-code buffer. It simply loads the ECX register with the number of scan lines to copy. exec head
dd db
OB9h,OOOh,OOOh,OOOh,OOOh

(size) mov ecx,0

The next few routines are the actual read and write routines (RC and WC). The first byte tells us how many bytes make up the code in each subroutine.
RC_BGR_1x8 dd 18

db db db db db db db
db RC_BGR_3x8 dd db

033h,OCOh OACh 08Bh,OD8h 003h,OCOh 003h,OC3h 003h,OC2h 08Bh,OOOh


025h,OFFh,OFFh,OFFh,OOOh 7 OADh

(size) xor eax,eax lodsb mov ebx.eax add eax,eax add eax,ebx add eax,edx mov eax,[eax] and e a x , - 1 (size) lodsd and e a x , - 1 dec esi

db db
RC BGR 4x8 dd

025h,OFFh,OFFh,OFFh,OOOh 04Eh

1
OADh

db
RC BGR 565:

I (size) ; lodsd ; ; (size) (size)

dd 1 lodsw dd 1 lodsw
dd 6

RC BGR 555:

WC_BGR_3x8

db db db WC_BGR_555 dd

OAAh OC1h,OE8h,008h 066h,OABh 28

(size) stosb shr eax,8 stosw (size) xor ebx,ebx shr al,3 shr ah, 3 mov bl.ah shl bx,5 or bl.al shr eax,l3h shl ax,OAh or ax,bx stosw (size) xor ebx,ebx

db

033h,ODBh

db db
db
db

OCOh,OE8h,003h OCOh,OECh,003h
08Ah,ODCh
066h,OC1h,OE3h,005h

db OOAh,OD8h db OC1h,OE8h,013h db 066h,OC1h,OEOh,OOAh db 066h,OOBh,OC3h db 066h,OABh


WC_BGR_565 dd 28

db

033h,ODBh

1.14 Self-Modifying Code

99
db db db db db db db db db
OCOh,OE8h,003h OCOh,OECh,002h 08Ah , ODCh 066h,OC1h,OE3h,005h OOAh,OD8h OC1h,OE8h,013h 066h,OC1h,OEOh,OOBh 066h,OOBh,OC3h 066h , OABh
1

; ; ; ; ; ; ; ; ;
! ;

shr al,3 shr ah, 2 mov bl.ah shl bx,5 or bl,al shr eax,13h shl ax.OBh or ax.bx stosw
(size) stosd

WC_BGR_4x8

dd db

OABh

Finally, we have a table that tells us which routine to use for every pixel format in BITMAP: :pf. rtbl_conv

dd dd dd dd dd dd dd dd dd dd

RC BGR 3x8 RC BGR 565

RC_BGR_555
RC BGR 4x8

RC_BGR_1 x8
WC WC WC WC
BGR BGR BGR BGR 3x8 565 555 4x8

wtbl_conv

1.15
File Management Using Resource Files
Bruno Sousa, Fireworks Interactive
[email protected]

s games increase in size (I think the grand prize goes to Phantasmagoria with seven CDs), there is a need for organization of the game data. Having 10 files in the same directory as the executable is acceptable, but having 10,000 is not. Moreover, there is the directory structure, sometimes going five or more levels deep, which is a pain to work with. Because our games will hardly resemble Windows Explorer, we need to find a clean, fast way to store and organize our data. This is where resource files come into play. Resource files give us the power to encapsulate files and directories into a single file, with a useful organization. They can also take advantage of compression, encryption, and any other features we might need.

What Is a Resource File?


We already use resource files all the time in our daily workexamples of these are WinZip, the Windows installer, and backup programs. A resource file is nothing more than a representation of data, usually from multiple files, but stored in just one file (see Listing 1.15.1). Using directories, we can make a resource file work just like a hard drive's file system does.

Listing 1.15.1 Resource file structure.


Signature = "SEALRFGNU" + '\0' Version = 1.0 Number of Files = 58 Offset of First File = 19 [File [File [File [File [File [File [File [File 1] 2] 3] .] .] .] Number Of Files - 1] Number Of Files]

100

1.15 File Management Using Resource Files

101

Each lump (we will start calling files "lumps" from now on) in the resource file has its own structure, followed by all of the data (see Listing 1.15.2).

Listing 1.15.2 File lump structure.


File Size = 14,340 Filename = "/bmp/Bob.bmp" + '\0'
Flags = COMPRESSED

Flagslnfo = OxF34A400B [Byte 1] [Byte 2] [Byte 3] [Byte .] [Byte .] [Byte .] [Byte File Size - 1] [Byte File Size]

Before we do anything else, we'll need to name our resource system. We can then use the name to give each component a special naming scheme, one that will differentiate it from the other parts of the game. Let's call this system the "Seal Resource File System," abbreviated to SRFS, and use "si" for class prefixes. First, we need a resource file header. By looking at Listing 1.15.1, it's easy to see that we are keeping our system simple. However, that doesn't mean it isn't powerful, it means that it was designed to accommodate the most-needed features and still retain a fairly understandable syntax and structure. Our resource file header gives us all the relevant information about the system. Multiple file types are used in games, and for each type, there is usually a file header that contains something unique to differentiate it from other file types. SRFS is no different, so the first data in its header is the file signature. This is usually a 5- to 10character string, and is required so that we can identify the file as a valid Seal resource file. The version information is pretty straightforwardit is used to keep track of the file's version, which is required for a very simple reason: if we decide to upgrade our system by adding new features or sorting the lumps differently, we need a way to verify if the file being used supports these new features, and if so, use the latest code. Otherwise, we should go back to the older codebackward compatibility across versions is an important design issue and should not be forgotten. The next field in the header is for special flags. For our first version of the file system, we won't use this, so it must always be NULL (0). Possible uses for this flag are described in the For the Future section. Following this is the number of lumps contained in the resource file, and the offset to the first lump. This offset is required to get back to the beginning of the resource file if we happen to get lost, and can also be used to support future versions of this system. Extra information could be added after this header for later versions, and the offset will point to the first lump.

102

Section 1

General Programming

We now move to our lump header, which holds the information we need to start retrieving our data. We start with the lump size in bytes, followed by name and directory, stored as a fixed-length, NULL-terminated string. Following this is the flags member, which specifies the type of algorithm(s) used on the lump, such as encryption or compression. After that is information about the algorithm, which can contain a checksum for encryption or dictionary information for compression (the exact details depend on the algorithms). Finally, after all of this comes the lump information stored in a binary form. Our system has only two modules: a resource file module and a lump module. To be able to use a lump, we need to load it from the resource file and possibly decrypt or decompress it into a block of memory, which can be accessed normally. Some systems prefer to encapsulate all functionality into the resource file module, and even allow direct access to lump data from within this module. This approach certainly has advantages, but the biggest disadvantage is probably that we need to have the whole resource in memory at once, unless we use only raw data or complicated algorithms to dynamically uncompress or decrypt our lump data to memory. This is a difficult process and beyond the scope of this gem. We need functions to open the resource file, read the header, open individual lumps, read information from lumps, and get data from lumps. These are covered in the Implementation section.

Implementation
The sample code included in the CD is written in C++, but for the text, we will use pseudocode so it will be easy to implement in any language. The sICLump Module Our lump module is similar to file streams in C++ or other language implementations of files in that we can write to it. Unfortunately, updating the resource file with a lump is very troublesome due to the nature of C++ streams. We can't add data to the middle of the streamwe can only replace itand we can't modify the parent resource file.
DWORD dwLumpSize; STRING szLumpName; DWORD dwLumpPosition; BYTE [dwLumpSize] abData;

The variable dwLumpSize is a double word (32 bits) that specifies the size of the lump, szLumpName is a string describing die lump's name, dwLumpPosition keeps the lump's pointer position, and abData is an array of bytes with the lump information. Here are the sICLump module functions:
DWORD GetLumpSize (void); STRING GetLumpName (void);

1.15 File Management Using Resource Files

103

DWORD Read (BYTE [dwReadSize] abBuffer, DWORD dwReadSize); DWORD Write (BYTE [dwReadSize] abBuffer, DWORD dwWriteSize); DWORD Seek (DWORD dwSeekPosition, DWORD dwSeekType); BOOLEAN IsValid (void);

GetLumpSizeO retrieves the lump's size, and GetLumpName() retrieves the lump's name. Read() reads dwReadSize bytes into sbBuffer, and Write () does the exact opposite, writing dwWriteSize bytes to sbBuffer. S e e k ( ) moves the lump's pointer by a given number from a seek position, and I sValid () verifies if the lump is valid. The sICResourceFile Module This module has all the functionality needed to load any lump inside the resource. The module members are nearly the same as those in the resource file header.
DWORD dwVersion; DWORD dwFlags; DWORD dwNumberOfLumps; DWORD dwOffset; STRING szCurrentDirectory; FILE fFile;

The use of these members has already been described, so here is a brief definition of each. dwVersion is a double word that specifies the file version, dwFlags is a double word containing any special flags for the lump, dwNumberOfLumps is the number of lumps in the resource, dwOffiet gives us the position in bytes where the first lump is located, szCurrentDirectory is the directory we are in, and fFile is the actual C++ stream. Now for the real meat of our system, the sICResourceFile functionsthose that we use to access each lump individually.
void void void STRING OpenLump (STRING szLumpName, slCLump inOutLump); IsLumpValid (STRING szLumpName); SetCurrentDirectory (STRING szDirectory); GetCurrentDirectory (void);

Each of these functions is very simple. IsLumpValid () checks to see if a file with a given szLumpName exists in the resource. SetCurrentDirectory () sets the resource file directory to szDirectory. This directory name is prepended to each lump's name when accessing individual lumps within the resource file. GetCurrentDirectory() returns the current directory. Now for our Open function. This function opens a lump within the resource file, and the logic behind the algorithm is described in pseudocode.
Check flags of Lump if Compressed OpenLumpCompressed if Encrypted

(szLumpName,

inOutLump)

104

Section 1

General Programming

OpenLumpEncrypted (szLumpName, inOutLump) if Compressed and Encrypted OpenLumpCompressedEncrypted (szLumpName, inOutLump) else OpenLumpRaw (szLumpName, inOutLump) end if

Depending on the lump type, the appropriate function to open the lump is called, thus maintaining a nice design and simple code. The source of each function is included in the CD.

Last Words about the Implementation


Some support functions that are used to open the file or to keep track of information that can't be called directly are not represented in the preceding text. It is advisable to ^ c^ 5? check the source code on the CD, which is well commented and easy to follow The ON me ca algorithms for compression and encryption are simple RLE compression and bit-wise encryption, the actual implementations of which are beyond the scope of this gem and must be researched separately. Information about useful public domain algorithms is at [WotsitOO], [WheelerOO], and [Gillies98].

Conclusion
This system can be easily upgraded or adapted to any project. Some possibilities include supporting date and time validation, copy protection algorithms, checksums, a data pool, and better compression and encryption algorithms. There is no limit.

References
[Hargrove98] Hargrove, Chris, "Code on the Cob 6," available online at www.loonygames.com/content/1.11/cote/, November 2-6, 1998. [TownerOO] Towner, Jesse, "Resource Files Explained," available online at www.gamedev.net/reference/programming/features/resfiles/, January 11, 2000. [WheelerOO] Wheeler, David J, et al, "TEA, The Tiny Encryption Algorithm," available online at www.cl.cam.ac.uk/ftp/users/djw3/tea.ps. [WotsitOO] Wotsit.org, "The Programmer's File Format Collection: Archive Files," available online atwww.wotsit.org, 19962000. [Gillies98] Gillies, David A. G., "The Tiny Encryption Algorithm," available online at http://vader.brad.ac.uk/tea/tea.shtml, 1995-1998.

1.16
Game Input Recording and Playback
Bruce Dawson, Humongous Entertainment
[email protected]

he eighteenth-century mathematician and physicist Marquis Laplace postulated that if there was an intelligence with knowledge of the position, direction, and velocity of every particle of the universe, this intelligence would be able to predict by means of a single formula every detail of the total future as well as of the total past [ReeseSO]. This is determinism. Chaos theory, Heisenberg's uncertainty principle, and genuine randomness in quantum physics have combined to prove determinism wrong. However, in the simplified universe of a game, Laplace's determinism actually works. . If you carefully record everything that can affect the direction of your game universe, you can replay your record and recreate what happened.

What Exactly Is Input Recording Useful For?


Game input recording is useful for more things than many people realize: reproducing rare bugs, replaying interesting games, measuring optimizations, or creating game movies. Reproducing Bugs Computer programs are deterministic and completely predictable, yet we frequently hear about people encountering bugs that are difficult to reproduce, and therefore difficult to fix. If computers are deterministic, how can bugs be difficult to reproduce? Occasionally, the culprit is the hardware or OS. The timing of thread switching and the hard drive is not completely consistent, so race conditions in your code can lead to rare crashes. However, the rare crashes are most frequently caused by a particular combination of user input that happens to be very rare. In that case, the bug is at least theoretically reproducible, if only we can reproduce the exact input sequence again. Videotaping of testing helps track some of these bugs, but it doesn't help at all if the timing is critical. Why don't we put that computer predictability to work, by having the computer program record all input and play it back on demand?
105

Section 1 General Programming

The crucial step here is that if we are to use input recording to track bugs, we have to make sure that the input is recorded even when the game crashesespecially when the game crashes! On Win32, this is typically quite easy. By setting up a structured exception handler [Dawson99], we can arrange for our input buffer to be saved whenever the game crashes. If we add an option to our game engine to "fast-forward" through game input (only rendering a fraction of the frames), we can get to the crash more quickly. If we also add an option to render frames at the exact same points in the game update loop, we can easily reproduce what are likely to be the relevant portions of the crash scenario. Reproducing bugs is the one time when you will want to record and playback all of the user input, including the interactions with menu screens. Menu code is not immune to tricky bugs. Replaying Interesting Games The most common use of game input recording is for players to record interesting games. These recordings are used to demonstrate how to play the game, to make tutorials, to test the performance of new computer hardware, or to share games. The most important thing about recording games for users to play back later is that the recording must always be enabled. It is unrealistic to expect users to decide at the beginning of the game whether they want to record the game for posterity; they should be asked at the end whether they want to permanently store the recorded game. Measuring Optimizations The most important thing to do when optimizing is to measure the actual performance, both before and after. Failing to do this leads to a surprisingly frequent tendency to check in "optimizations" that actually slow the code. Measuring the performance of a game is tricky because it varies so much. Polygon count, texture set, overdraw, search path complexity, and the number of objects in the scene all affect the frame rate. Timing any single frame is meaningless, and a quick run-through is hopelessly unscientific. Game input playback is a great solution. If you run the same playback multiple times, recording detailed information about game performance, you can chart your progress after each change to see how you're doing and where you still need work. Recording the average and worst-case frame rate and the consistency of the frame rate becomes easier and much more meaningful. Testing optimizations with game input playback doesn't always work because your changes might affect the behavioryour wonderful new frame rate might just mean the player has walked into a closet. Therefore, when using game input playback for optimization testing, it is crucial that you record critical game state and check for changes on playback.

1.16 Game Input Recording and Playback Creating Game Movies

107

To create a demo reel, you can hook a VCR up to a video capable graphics card and play the game; however, the results will not be pretty. The VCR, the video encoder, and the variable frame rate of the game will lead to a blurry, jerky mess. With game input recording, it's trivial to record an interesting game, and then play it back. With some trivial modifications to the engine you will be able to tell the game engine when you are at an interesting part of the playback, at which point you can switch from real-time playback to movie record playback. In this mode, the engine can render precisely 60 frames for each second of game play, and record each one to disk. The frame rate may drop to an abysmal two frames per second, but it doesn't matter because the canned inputs will play back perfectly. Implementing Multiplayer A number of gamesX-Wing vs. TIE Fighter, and Age of Empireshave used input recording and playback for their networking model [Lincroft99]. Instead of transmitting player status information, they just transmit player input. This works particularly well for strategy games with thousands of units.

What Does ItJTake?


Game input recording is simple in theory, and can be simple in practice as well. However, there are a few subtleties that can cause problems if you're not careful. Making Your Game Predictable For game input recording and playback to work, your game must be predictable. In other words, your game must not be affected by anything unpredictable or unknowable. For example, if your game can be affected by the exact timing of task switching, then your game is unpredictable. Many games use variably interleaved update and render loops. Input is recorded at a set frequency. A frame is rendered and then the game update loop runs as many times as necessary to process the accumulated set of inputs. This model implies that the number of times that the game update loop is run for each frame rendered is unpredictable; however, this needn't make the game itself unpredictable. If you are tracking down a bug in the Tenderer, then you may need to know the exact details of how the render loop and update loop were interleaved, but the rest of the time it should be irrelevant. It is worthwhile to record how many updates happened for each frame, but this information can be ignored on playback unless you are tracking a Tenderer bug. However, if the render function does anything to change the state of the game, then the variably interleaved update loop and render function do make the game unpredictable, and input recording will not work. One example of this is a render function that uses the same random number generator as the update loop. Another

Section 1 General Programming example can be found in Total Annihilation. In this game, the "fog of war" was only updated when the scene was rendered. This was a reasonable optimization because it reduced the frequency of this expensive operation. While it ensured that the user only ever saw accurate fog, it made the game's behavior unpredictable. The unit AI used the same fog of war as the Tenderer; the timing of the render function calls would subtly affect the course of the game. Another example of something that can make a game unpredictable is uninitialized local variables or functions that don't always return results. Either way, your game's behavior will depend on whatever happened to be on the stack. These are bugs in your code, so you already have a good reason to track them down. One tricky problem that can lead to unpredictability is sound playback. This can cause problems because the sound hardware handles them asynchronously. Tiny variances in the sound hardware can make a sound effect occasionally end a bit later. Even if the variation is tiny, if it happens to fall on the cusp between two frames, then it can affect your game's behavior if it is waiting for the sound to end. For many games, this is not a problem because there is no synchronization of the game to the end of these sounds. If you do want this synchronization, then there is a fairly effective solution: approximation. When you start your sound effect, calculate how long the sample will playnumber of samples divided by frequency. Then, instead of waiting for the sound to end, wait until the specified amount of time has elapsed. The results will be virtually identical and they will be perfectly consistent. Initial State You also need to make sure that your game starts in a known state, whether starting a new game or loading a saved one. That usually happens automatically. However, each time you recompile or change your data you are slightly changing the initial state. Luckily, many changes to code and data don't affect the way the game will unfold. For instance, if you change the size of a texture, then the frame rate may change, but the behavior should notas long as the game is predictable. If changing the size of that texture causes all other memory blocks to be allocated at different locations, then this should also have no effectas long as your code doesn't have any memory overwrite bugs. An example of a code or data change that could affect how your game behaves would be changing the initial position of a creature or wall, or slightly adjusting the probability of a certain event. Small changes might never make a difference, but they destroy the guarantee of predictability. Floating-point calculations are one area where your results may unexpectedly vary. When you compile an optimized build, the compiler may generate code that gives slightly different results from the unoptimized buildand occasionally, these differences will matter. You can use the "Improve Float Consistency" optimizer setting in Visual C++ to minimize these problems, but floating-point variations are an unavoidable problem that you just have to watch for.

1.16 Game Input Recording and Playback Random Numbers

109

Random numbers can be used in a deterministic game, but there are a few caveats. The reason random numbers can be used is that rand() isn't really random. rand() is implemented using a simple algorithmtypically a linear congruential method that passes many of the tests for random numbers while being completely reproducible. This is called a pseudo-random number generator. As long as you initialize rand() with a consistent seed, you will get consistent results. If having the randomness in your game different each time is important, then choose a seed for srandQ based on the time, but record the seed so that you can reuse it if you need to reproduce the game. One problem with rand() is that it produces a single stream of random numbers. If your rendering code and your game update code are both using rand()and if the number of frames rendered per game update variesthen the state of the random number generator will quickly become indeterminate. Therefore, it is important that your game update loop and your Tenderer get their random numbers from different locations. Another problem with rand() is that its behavior isn't portable. That is, the behavior is not guaranteed to be identical on all platforms, and it is unlikely that it will be. The third problem with rand() comes if you save a game and continue playing, and then want to reload the saved game and replay the future inputs. To make this work predictably, you have to put the random number generator back to the state it was in when you saved the game. The trouble is, there's no way to do this. The C and C++ standards say nothing about the relationship between the numbers coming out of rand() and the number you need to send to srand() to put it back to that state. Visual C++, for instance, maintains a 32-bit random number internally, but only returns 15 of those bits through rand(), making it impossible to reseed. These three problems lead to an inescapable conclusion: don't use rand(). Instead, create random number objects that are portable and restartable. You can have one for your render loop, and one for your game update loop. When implementing your random number objects, please don't invent your own random number algorithm. Random number generators are very subtle and you are unlikely to invent a good one on your own. Look at your C runtime source code, the sample code on the CD, Web resources [Coddington], or read Knuth [KnuthSl]. Inputs Once you have restored your game's initial state, you need to make sure that you can record and play back all of the input that will affect your game. If your game update loop is calling OS functions directly to get user inputsuch as calling the Win32 function GetKeyState(VK_SHIFT) to find out when the Shift key is downthen it will be very hard to do this. Instead, all input needs to go through an input system. This system can record the state of all of the input devices at the beginning of each frame, and hand out this information as requested by the game update loop. The

110

Section 1

General Programming

input system can easily record this information to disk, or read it back from disk, without the rest of the game knowing. The input system can read data from DirectInput, a saved game, the network, or a WindowProc, without the update loop knowing the difference. As a nice bonus, isolating the game input in one place makes your game code cleaner and more portable. Programmers have a habit of breaking all rules that are not explicitly enforced, so you need to prevent them from calling OS input functions directly. You can use the following technique to prevent programmers from accidentally using "off-limits" functions.
#define GetKeyState Please do not use this function tfdefine GetAsyncKeyState Please do not use this function either

Another important input to a multiplayer game is the network. If you want to be able to replay your game, dien you need to record the incoming network data together with the user's input stream. This will allow you to replay the game, even without a network connection. The network data stream is the one type of data that can actually get quite largea game running on a 56K modem could easily receive many megabytes of network data per hour. While this large data stream does make the recording more unwieldy, it is not big enough to be really problematic. The benefits of recording this stream are enormous, and the costs are quite small. The final "input" that a game might use is time. You may want certain events to happen at a specific time, and it is important that these times are measured in game time, not in real time. Whenever your game needs to know the timeexcept for profiling purposesit should ask the game engine for the current game time. As with the other input functions, it is a good idea to use the preprocessor to make sure that nobody accidentally writes code that calls timeGetTimeO or other OS time functions. It is a good idea to record inputs throughout the game. That lets you use input playback to track down bugs anywhere in the game, even in the pre-game menus. However, for many purposes you will want to store the record of the input during the game separately, so that you can play it back separately.

Testing Your Input Recording


Game input recording should work on any well-written game. Even if your game is a multiplayer game, if you record every piece of input that you receive on your machine, then you should be able to reproduce the same game. However, if your game playbacks are failing to give consistent results, it can be difficult to determine why. A useful option in tracking down these problems is recording part of the game state along with die inputperhaps the health and location of all of the game entities. Then, during playback, you can check for changes and detect differences before they become visible.

1.16 Game Input Recording and Playback

111

Conclusion
Game input recording and playback is a valuable part of a game engine with many benefits. If it is planned from the beginning, then it is easy to add, and leads to a better-engineered and more flexible game engine. Here are some rules to follow: Route all game input, including keyboard, mouse, joystick, network, and time, through a single input system, to ensure consistency and to allow recording and saving of all input. This input should always be recorded. It should be stored permanently in case the game crashes or the user requests it at the end of the game. Watch for floating-point optimizations or bugs in your code that can occasionally lead to behavior that is different or unpredictable in optimized builds. The randQ function should be avoided; use random number objects instead. Never change the game's state in rendering functions. Store some of your game state along with the input so you can automatically detect inconsistencies. This can help detect race conditions, unintended code changes, or bugs. The sample code on the CD includes an imput system and a random number class.

References
[ReeseSO] Reese, W.L., Dictionary of Philosophy and Religion. Humanities Press, Inc. 1980. p. 127. [KnuthSl] Knuth, Donald, The Art of Computer Programming, Second Edition, Volume 2, Seminumerical Algorithms. [Coddington] Coddington, Paul, "Random Number Generators," available online at www.npac.syr.edu/users/paulc/lectures/montecarlo/node98.html. [Dawson99] Dawson, Bruce, "Structured Exception Handling," Game Developer magazine (Jan 1999): pp. 52-54. [Lincroft99] Lincroft, Peter, "The Internet Sucks: What I Learned Coding X-Wing vs. TIE Fighter," 1999 Game Developers Conference Proceedings, Miller Freeman 621-630.

1.17
A Flexible Text Parsing System
James Boer, Lithtech, Inc.
[email protected]

early every modern game requires some sort of text parser. This gem, along with the sample code on the CD, demonstrates a powerful but easy-to-use text parsing system designed to handle any type of file format. Text files have a number of advantages when representing data: oNiHfco . They aj-e easy to reacj and efat using any standard text editor. Binary data usually requires a custom-built tool that must be created, debugged, and maintained. They are flexiblethe same parser can be used for simple variable assignment or a more complex script. They can share constants between code and data (more on this later). Unfortunately, text data has a few drawbacks as well: Unlike most binary formats, text must first be tokenized and interpreted, slowing the loading process. Stored text is not space efficient; it wastes disk space and slows file loading. Because many game parameters only need to be tweaked during development, it may be practical to use a text-based format during development, and then switch to a more optimized binary format for use in the shipping product. This provides the best of both worlds: the ease of use of text files, and the loading speed of binary data. We'll discuss a method for compiling text files into a binary format later in the gem.

The Parsing System


Here's what our parser will support: Native support for basic data types: keywords, operators, variables, strings, integers, floats, boots, and GUIDs Unlimited user-definable keyword and operator recognition Support for both C (block) and C++ (single-line) style comments Compiled binary read and write ability Debugging support, able to point back to a source file and line number in case of error

1.17 A Flexible Text Parsing System

113

#include file preprocessing support #define support for macro substitution Most of the preceding items are self-explanatory, but #indude files and #define support may seem a bit out of place when discussing a text parser. We'll discuss how these features can greatly simplify scripts, as well as provide an additional mechanism to prevent scripts and code from getting out of sync.

Macros, Headers, and Preprocessing Magic


Preprocessing data files in the same manner as C or C++ code can have some wonderful benefits. The concept is perhaps best explained by a simple example. Let's assume that we wish to create a number of unique objects using a script file, which will provide the necessary data to properly initialize each object and create unique handles for use in code. Here's what such a script might look like: CreateFoo(l) CreateFoo(2) CreateFoo(3) CreateBar(4) { { { { Data = 10 } Data = 20 } Data = 30 } Foo = 1 }

Assuming that the CreateFooQ keyword triggers the creation of a Foo object in code, we now have three Foo objects in memory, each with unique member data, created by a script. Also, assuming that we're referencing these objects with handles, we can now access these objects in code with the values of 1, 2, and 3 as unique handles. Note that in our example, the script can also use these numeric handles. The Bar class requires a valid Foo object as a data member, and so we use a reference to the first Foo object created when creating our first Bar object. It could get easy to lose track of the various handle values after creating several hundred of them. Any time an object is added in the script, the programmer must change the same values in code. There are no safeguards to prevent the programmer from accidentally referencing the wrong script object. This problem has already been solved in C and C++ through the use of header files in which variables and other common elements can be designed for many source files to share. If we think of the text script as simply another source file, the advantages of a C-like preprocessor quickly become apparent. Let's look again at our example using a header file instead of magic numbers. - Header File // ObjHandles.h // Define all our object handles tfdefine SmallFoo 1 tfdefine MediumFoo 2 #define LargeFoo 3 #define SmallBar 4 #define FooTypeX 10

114 #define FooTypeY tfdefine FooTypeZ - Script File 20 30

Section 1

General Programming

// // Directs the parser to scan the header file ^include "ObjHandles.h"


CreateFoo(SmallFoo) CreateFoo(MediumFoo) CreateFoo(LargeFoo) CreateBar(SmallBar) { Data = FooTypeX } { Data = FooTypeY } { Data = FooTypeZ } { Foo = SmallFoo }

In addition to this being much easier to read and understand without the magic numbers, both the text script and source code share the same header file, so it's impossible for them to get out of sync. Because we're already performing a simple preprocessing substitution with #define, it's just one more step to actually parse and use more complex macros. By recognizing generic argument-based macros, we can now make complex script operations simpler by substituting arguments. Macros are also handy to use for another reason. Because macros are not compiled in code unless they are actually used (like a primitive form of templates), we can create custom script-based macros without breaking C++ compatibility in the header file. Note diat although we're processing macros and #defines, the parser does not recognize other commands such as #ifdef, #ifndef, and #endif.

The Parsing System Explained


There are five classes in our parsing system: Parser, Token, TokenList, TokenFile, and Macro. The Macro class is a helper class used internally in Parser, so we only need to worry about it in regard to how it's used inside Parser. TokenFile is an optional class used to read and write binary tokens to and from a standard token list. This leaves the heart of the parsing system: Parser, Token, and TokenList. Because Token is the basic building block produced by the parser, let's examine it first.
The Token Class

The basic data type of the parsing system is the Token class. There are eight possible data types represented by the class: keywords, operators, variables, strings, integers, real numbers, Booleans, and GUIDs. Keywords, operators, variables, and strings are all represented by C-strings, and so the only real difference among them is semantic. Integers, real numbers, and Booleans are represented by signed integers, doubles, and booh. For most purposes, this should be sufficient for data representation. GUIDs, or Globally Unique IDentifiers, are also given native data type status, because it's often handy to have a data type that is guaranteed unique, such as for identifying classes to create from scripts.

1.17 A Flexible Text Parsing System

115

The Token class is comprised of a type field and a union of several different data types. A single class represents all basic data types. Data is accessed by first checking what type of token is being dealt with, and then calling the appropriate GetQ function. Asserts ensure that inappropriate data access is not attempted. Each of the data types has a role to play in the parser, and it's important to understand how they work so that script errors are avoided. In general, the type definitions match similar definitions in C++. All keywords and tokens are case sensitive.
Keyword

Keywords are specially defined words that are stored in the parser. Two predefined keywords are include and define. User-defined keywords are used primarily to aid in lexicographical analysis of the tokens after the scanning phase. Operator An operator is usually a one- or two-character symbol such as an assignment operator or a comma. Operators are unique in the fact that they act like white space regarding their ability to separate other data types. Because of this, operators always have the highest priority in the scanning routines, meaning that the symbols used in operators cannot be used as part of a keyword or variable name. Thus, using any number or character as part of an operator should be avoided. Operators in this parsing system also have an additional restriction: because of the searching method used, any operator that is larger than a single character must be composed of smaller operators. The larger symbol will always take precedence over the smaller symbols when they are not separated by white space or other tokens.
Variable

A variable is any character-based token that was not found in the keyword list. String A string must be surrounded by double quotes. This parser supports strings of lengths up to 1024 characters (this buffer constant is adjustable in the parser) and does not support multiple-line strings.

Integers
The parser recognizes both positive and negative numbers and stores them in a signed integer value. It also recognizes hexadecimal numbers by the Ox prefix. No range checking is performed.

Floats
Floating-point numbers are called floats and are represented by a double value. The parser will recognize any number with a decimal point as a float. It will not recognize scientific notation, and no range checking is performed on the floating-point number.

Booleans
Boolean values are represented as a native C++ booltype, and true and false are builtin keywords. As with C++, these values are case sensitive.

116 QUIDs

Section 1

General Programming

By making use of the macro-expansion code, we can support GUIDs without too much extra work. Note that unless the macro is expanded with ProcessMacrosQ, the GUID will remain a series of separate primitive types. This function is described later.
The TokenLlst Class

The TokenList class is publicly derived from a standard STL list of Tokens. It acts exactly like a standard STL list of tokens, and has a couple of additional features. The TokenList class allows viewing of the file and line number that any given token comes from. This is exclusively an aid for debugging, and can be removed with a compiletime flag.
The Parser Class

This is the heart of the parsing functionality. We first create a parser object and call the CreateQ function. Note that all functions return a boot value, using true for success and false for failure. Next, we must reserve any additional operators or keywords beyond the defaults required for the text parsing. After this comes the actual parsing. The parsing phase is done in three passes, handled by three functions. Splitting the functionality up gives the user more control over the parsing process. Often, for simple parsing jobs, #include file processing and macro substitution are not needed. The first pass reads the files and translates the text directly into a TokenList using the function ProcessSource(). The next function, ProcessHeadersQ, looks for any header files embedded in the source, and then parses and substitutes the contents of those headers into the original source. The third function, ProcessMacrosQ, performs both simple and complex C-style macro substitution. This can be a very powerful feature, and is especially useful for scripting languages. Let's see what this whole process looks like. Note that for clarity and brevity's sake, we are not doing any error checking.
/ / W e need a Parser and TokenList object to start TokenList toklist; Parser parser; // Create the parser and reserve some more keywords and tokens parser.Create();

parser.ReserveKeyword("special_keyword"); parser.ReserveOperator("["); parser.ReserveOperator("]"); // Now parse the file, any includes, and process macros parser.ProcessSource("data\scripts\somescript.txt", &toklist); parser.ProcessHeaders(&toklist); parser.ProcessMacros(&toklist);

1.17 A Flexible Text Parsing System

117

The TokenFile Class Because parsing and processing human readable text files can be a bit slow, it may be necessary to use a more efficient file format in the shipping code. The TokenFile class can convert processed token lists into a binary form. This avoids having to parse the text file multiple times, doing #include searches, macro substitutions, and so forth. Character-based values, such as keywords, operators, and variables, are stored in a lookup table. All numeric values are stored in binary form, providing additional space and efficiency savings. In general, this binary form can be expected to load five to ten times as fast as the text-based form. Using the TokenFile class is simple as well. The WriteQ function takes a TokenList object as an argument, and creates the binary form using either the output stream or filename that was specified. The class can also store the file in either a case-sensitive or case-insensitive manner. If both the variable "Foo" and "foo" appear in the script, turning the case sensitivity off will merge them together in the binary format, providing further space savings. It defaults to off. Reading the file is performed with the Read() function. Here's how it looks in code:
TokenFile tf;

// Write a file to disk tf.Write("somefile.pcs", &toklist); / / O r read it


tf.Read("somefile.pcs", &toklist);

Wrapping Up
Text file processing at its simplest level is a trivial problem requiring only a few lines of code. For anything more complex than this, however, it's beneficial to have a comprehensive text-parsing system that can be as flexible and robust as the job demands.

1.18
A Generic Tweaker
Lasse Staff Jensen, Funcom
[email protected]

uring game development, one of the most frequent tasks we perform is tweaking ( variables until the game has just the right balance we need. In this gem, we will cover an easy-to-use "tweaker" interface and the design issues behind the implementation.

Requirements Analysis
One of the primary goals of a generic tweaker interface is to make it as transparent and easy to use as possible. The user in this case is the programmer who exposes variables to be tweaked. Further requirements to emphasise are the size in memory, the ability to tweak a variable without too much added overhead, and the speed of actually tweaking a variable (because in some cases the tweaker will be used in the release build as well). Let's try to break down the requirements in more detail, and see what the implementation really needs to do: It should be transparent to the coder, meaning that the variables we want to tweak shouldn't contain any additional data and/or functionality, and that the usage of these variables shouldn't need to know about the tweaker at all. It should be simple to use, meaning that the user should be able to define variables to be tweaked in less than 10 lines of code, and be able to tweak and get variables from a common database in typically two or three lines of code.

Implementation Design
Figure 1.18.1 contains the UML diagram of the classes to be presented in a bottomup fashion in the rest of this gem. The type information and the tweakable hierarchy are the essence of this design.

118

"*"'1' l;l"wfer
Tweaker_c ^Tweakables: TweakableBase_c *AddTweakable() *TweakValue()

TweakableTypeRange_c : void : void ^TypelD_c*: void TweakerlnstanceDB^c ^Categories: Tweaker_c ^Instances: Tweaker_c *AddTweaker() *GetMax() *GetMin() *GetStoredType()

lntTypelD_c *GetType()

FloatTypelD_c *GetType()

BoolTypelD_c *GetType()

FIGURE 1.18.1 Overview of the tweaker classes.

(O

Section 1 General Programming

Type Information We will use template specialization to provide type information that we can store in a uniform way. First is our base class TypelDjc that defines the interface for our type information class with a virtual function that returns a string with the type name:
class TypeID_c { public: virtual const char* GetTypeNameO const { return "Unknown"; }

};
Next, we create a template class that we can use to retrieve the correct type when given the variable. In this class, we add a member to get the pointer to our TypeID_c instance that can be tested directly for the stored pointer address. template <class T> class Identifier_c { public: static const TypeID_c* const GetType(); Now that we have this class declared, we will use template specialization to define each unique type. Each subclass of TypeID_c will exist as a singleton, and the pointer to that instance serves as the identifier of the type. For simplicity, all of these will be placed in the global scope through static members. We can make sure that the actual instances exist, if called from other static functions, by receiving the pointer from the Getldentification method. The full implementation for float values follows:
class floatID_c : public TypeID_c { public: virtual const char* GetTypeNameO const { return "float"; } static TypeID_c* const Getldentification () ; }5

template <> class Identifier_c<float> { public:


static const TypeID_c* const GetType() { return floatID_c: :GetIdentification() ;

TypeID_c* const floatID_c: :GetIdentification() { static floatID_c clnstance; return &clnstance; To use these classes for type information, we can simply store the base pointer: float vMyFloat; const TypeID_c* const pcType = TweakableBase_c: :GetTypeID( vMyFloat ) ;

1.18 A Generic Tweaker

121

Here, the TweakableBase_c (more on this class later) has a template member that calls the correct Identifier^ specialization. Then we can test the address of the pointer:
if( Identifier_c<float>::GetType() // We have a float! == pcType ) {

There are two macros for defining user types in the code on the accompanying CD, so all that's required for support of a new data type is to place a call to DECLARE_DATA_TYPE in the header and DEFINE_DATA_TYPE in the implementation file, and then recompile. (In addition, one might want to add a call to the macro DUMMY_OPERATORS () in case one doesn't want to support range checking.) TweakableBase_c We have a clean and easy way to store type info, so let's continue by creating the base class to hold the pointer to the tweakable variable. This class also contains the template member for getting the type info mentioned earlier. Because one of our requirements is to keep memory overhead to a minimum, we will use RTTI for checking which specific types of tweakables we have stored in memory. We therefore make sure the class is polymorphic by adding a virtual function to get the type info stored (or NULL if none). Here is the implementation: class TweakableBase_c { public: TweakableBase_c( void* i_pData ) : m_pData( i_pData ) {;} -TweakableBase_c() { /*NOP*/;} virtual const TypeID_c* const GetStoredType() const { return NULL; } template <class T> static const TypeID_c* const GetTypeID( const T& i_cValue ) { return Identifier_c<T>::GetType(); } protected: void* m_pData; }; // TweakableBase_c Now that we have the base class, we can create subclasses containing additional data such as type information, limits for range checking, a pointer for a call-back function, and any other data we might need to attach to the various tweakables, while keeping die memory to a minimum. Here is how one of the specific tweakable classes looks: template <class T> class TweakableType_c : public TweakableBase_c {

122

Section 1

General Programming

public:

TweakableType_c( T* i_pxData, const TypeID_c* i_pcType ) : TweakableBase_c( reinterpret_cast<void*>( i_pxData ) ), m_pcType( i_pcType ) { /*NOP*/; } const TypeID_c* const GetDataType() const { return m_pcType; } virtual const TypeID_c* const GetStoredType() const { return m_pcType; }
private: const TypeID_c* const m_pcType; }; // TweakableType_c
:

The great thing about this code is that the subclasses are implemented as templates, even though the base class was defined without them. This way, we can pass in the pointer to the actual data type, hiding the casting to void horn the interface. Tweakerje We finally have all the building blocks we need to create the tweaker class itself. This class will store all of our tweakables and give the user functionality for tweaking the stored values. We will use an STL map to hold all of the pointers to our tweakables, using the name of each tweakable as the key. Simple template members provide all the functionality. An example of this is the TweakValue member:
template<class Value_x> TweakError_e TweakValue( const std::string& i_cID, const Value_x& i_xValue ) {

TweakableBase_c* pcTweakable; iTweakableMap_t iSearchResult = m_cTweakable_map.find( i_cID ); if( iSearchResult == m_cTweakable_map.end() ) { return e_UNKNOWN_KEY; } pcTweakable = (*iSearchResult).second; #ifdef _DEBUG TweakableType_c<Value_x>* pcType; if( pcType = dynamic_cast< TweakableType_c<Value_x>* >( pcTweakable ) ) { assert( pcTweakable->GetTypeID( i_xValue ) == pcType-GetDataType() ); } #endif TweakableTypeRange_c<Value_x>* pcTypeRange; if ( pcTypeRange = dynamic_cast< TweakableTypeRange_c<Value_x>* >( pcTweakable ) ) { assert( pcTweakable->GetTypeID( i_xValue ) == pcTypeRange->GetDataType() ); if( i_xValue < pcTypeRange->GetMin() ) { return e_MIN_EXCEEDED; } if( i_xValue > pcTypeRange->GetMax() ) { return e_MAX_EXCEEDED; }

1.18 A Generic Tweaker

123

*(reinterpret_cast<Value_x*>( pcTweakable->m_pData ) ) = i_xValue; return e_OK; } // TweakValue

Because the member is a template, we can cast back to the given value directly, thereby completely hiding the ugly void casting. Note that if users decide to not store the type information, they could easily force us to do something bad, since we have no way of checking the origin of the reinterpret_casA
TweakerinstanceDB_c

In order to support grouping of tweakables and the ability to store several instances of a given variable, we have an instance database to hold different tweakers. The implementation is straightforwardan STL multimap holding all of the instances of different tweakers, and an STL map of these multimaps where the category is the key.

Let's test our implementation against the requirements to verify that we have reached our goals. Defining a variable to be tweakable requires us to create a tweaker and add it to the tweakable instance database.
Tweaker_c* pcTweaker = TweakerInstanceDB_c::AddTweaker( "Landscape", TWEAKER_CREATE_ID( this ), "Graphics" );

Here we create a tweaker for the class Landscape (inside the constructor, for example) and put it in the Graphics category. The TWEAKER_CREATE_ID macro takes the this pointer and makes sure that each instance of the class Landscape gets a unique ID. Then, we simply add each variable to this (and other tweakers we make) by:
pcTweaker->AddTweakable( &m_vShadowmapScaleTop, "Shadowmap scale", O.OF, 68.OF );

Here we have added a variable, constrained it to the interval [0, 68], and called it "Shadowmap scale." It's vital to note that because of the template nature of the AddTweakable method, we must pass correct types to all of the arguments (for example, use O . O F and not just 0). Defining a variable to be tweakable takes two lines of code, and is totally hidden from the users of the variable in question. For tweaking this variable, all we need is the name, data type, and desired instance. Usually, we have the pointer to the tweaker instance itself, but in the GUI code, one would typically do something like:
TweakerInstanceDB_c::iConstCategoryMap_t iCategory = TweakerInstanceDB_c::GetCategory( "Graphics" ); Tweaker_c* pcTweaker = GetTweaker( iCategory->second, "Landscape", TWEAKER_CREATE_ID( pcLandscape ) ) ;

124

Section 1 General Programming Here we first get all of the instance maps that are stored under the "Graphics" category label. Then we search for the correct instance of the Landscape class (we assume the pointer pcLandscape points to the instance in question). Changing the value of a specific value is straightforward.
Tweaker_c::TweakError_e eError; eError = pcTweaker->TweakValue( "Shadowmap scale", 20.OF );

So, tweaking a variable is one line of code, with additional lines for error handling (or simply asserting the return value). Receiving the stored value is done similarly: float vShadowmapScale; eError = pcTweaker->GetValue( "Shadowmap scale", &vShadowmapScale );

Graphical User Interface


GUIs tend to be specific to each project, so I have left a general discussion of this topic out of this gem, although I will describe an existing implementation as a source for ideas. In Equinox, Funcom's next-generation 3D engine, we have implemented a directory structure, as shown in Figure 1.18.2, that can be browsed in a console at the top of the screen. For tweaking values, we have defined input methods that can be assigned to the tweakables. That way, we can create specialized input methods such as the angle tweaker seen in Figure 1.18.3. For saving and loading, in addition to the binary snapshots, we can save all of the tweakables in #define statements directly into h files. Because the number of instances of a variable could change over the lifetime of the application, we only save the first instance into the header file. This feature gives us the capability to add variables to be tweaked only in debug builds, and we then #indude the header file to initialize the

Tweaker: Application [..,.] ._ , "Fog density""""" '" " Fog end Fog start Linear fog Physical water Show Equinox logo Show caustics Show fog Show landscape Show sky Show water Table fog "' "' """ """ *-> .-.-.-

FIGURE 1.18.2 Screen shot from our GUI. The user can move up and down in the directories (categories in the code) and choose values to be tweaked.

1.18 A Generic Tweaker Tweaker: Graphics Tweaker instance name: GraphicsTestInstance AngleTweak 1/2 Type: float Ualue = 56.649902 Limited to range <45.080000, 120.800008> step = 8,758006, use +//spaee to modify

125

FIGURE 1.18.3 This specialized input gives the user the possibility to visually tweak angles in an intuitive way.

variables to the latest tweaked value in the release build. Here is a sample of how this works for our ShadowmapScale variable:
landscape_tweakables.h:

tfdefine

SHADOWMAP_SCALE

43.5

landscape.cpp: tfinclude "landscape_tweakables.h"

m_vShadowmapScale = SHADOWMAP_SCALE;

It is possible to use the RTTI typeidQ to replace the type information code detailed previously. There are pros and cons to using our type information code. Pros: It takes up less space for the type information, since it is only required for classes that use it. One can add specific information to the TypeID_c class; for example, a way to load and store the type or a pointer to the GUI control. Cons: We have to use macros for each unique type, while RTTI provides the type information automatically.

126

Section 1

General Programming

Acknowledgment
I would like to thank Robert Golias for invaluable help and suggestions, and for implementing the Equinox tweaker GUI that was an excellent test of how simple the interface actually turned out!

1.19
Genuine Random Number Generation
Pete Isensee, Microsoft
[email protected]

omputer games use random numbers extensively for rolling dice, shuffling cards, simulating nature, generating realistic physics, and performing secure multiplayer transactions. Computers are great at generating pseudo-random numbers, but not so good at creating genuine random numbers. Pseudo-random numbers are numbers that appear to be random, but are algorithmically computed based on the previous random number. Genuine, or real, random numbers are numbers that not only appear random, but are unpredictable, nonrepeating and nondeterministic. They are generated without the input of the previous random number. This gem presents a method of creating genuine random numbers in software.

Pseudo-Randomness
Pseudo-random number sequences eventually repeat themselves and can always be precisely reproduced given the same seed. This leads to distinct problems in gaming scenarios. Consider the common case of a game that initializes its random number generator (RNG) with the current tick count - the number of ticks since the machine was booted up. Now assume the player turns on their gaming console every time they begin playing this game. The level of randomness in the game is gated by the choice of seed, and the number of bits of randomness in the seed is unacceptably small. Now consider the use of a pseudo-RNG to create secret keys for encrypting secure multiplayer game transmissions. At the core of all public key cryptographic systems is the generation of unpredictable random numbers. The use of pseudo-random numbers leads to false security, because a pseudo-random number is fully predictabletranslate: easily hackedif the initial state is known. It's not uncommon for the weakest part of crypto systems to be the secret key generation techniques [Kelsey98].

Genuine Randomness
A genuine random number meets the following criteria: it appears random, has uniform distribution, is unpredictable, and is nonrepeating. The quality of
127

128

Section 1

General Programming

unpredictability is paramount for security purposes. Even given full knowledge of the algorithm, an attacker should find it computationally infeasible to predict the output [Schneier96]. The ideal way of creating genuine random numbers is to use a physical source of randomness, such as radioactive decay or thermal noise. Many such devices exist; see [Walker(a)] for one example. However, PCs and video game consoles do not typically have access to these types of devices. In the absence of a hardware source, the technique recommended by RFC 1750 [Eastlake94] is "to obtain random input from a large number of uncorrelated sources and mix them with a strong mixing function." By taking input from many unrelated sources, each with a few bits of randomness, and thoroughly hashing and mashing them up, we get a value with a high degree of entropya truly random number.

Random Input Sources


Examples of random input available on many PCs and game consoles include: System date and time Time since boot at highest resolution available Username or ID Computer name or ID State of CPU registers State of system threads and processes Contents of the stack Mouse or joystick position Timing between last N keystrokes or controller input Last N keystroke or controller data Memory status (bytes allocated, free, etc.) Hard drive state (bytes available, used, etc.) Last N system messages GUI state (window positions, etc.) Timing between last N network packets Last N network packet data Data stored at a semi-random address in main memory, video memory, etc. Hardware identifiers: CPU ID, hard drive ID, BIOS ID, network card ID, video card ID, and sound card ID

Some of these sources will always be the same for a given system, like the user ID or hardware IDs. The reason to include these values is that they're variable across machines, so they're useful in generating secret keys for transmitting network data. Some sources change very little from sample to sample. For instance, the hard drive state and memory load may only change slightly from one read to the next. However, each input provides a few bits of randomness. Mixed together, they give many bits of randomness.

1.19 Genuine Random Number Generation

129

The more bits of entropy that can be obtained from input sources, the more random the output. It's useful to buffer sources such as mouse positions, keystrokes, and network packets over time in a circular queue. Then the entire queue can be used as an input source.

Hardware Sources
Some gaming platforms have access to physical sources of randomness. When these sources are available, they make excellent input sources. Examples of physical sources include: Input from sound card (for example, the microphone jack) with no source plugged in Input from a video camera Disk drive seek time (hard drive, CD-ROM, DVD) Intel 810 chipset hardware RNG (a thermal noise-based RNG implemented in silicon) [Intel99]

Mixing Function
In the context of creating genuine random numbers, a strong mixing function is a function where each bit of the output is a different complex and nonlinear function of each and every bit of the input. A good mixing function will change approximately half of the output bits given a single bit change in the input. Examples of strong mixing functions include: DES (and most other symmetric ciphers) Diffie-Hellman (and most other public key ciphers) MD5, SHA-1 (and most other cryptographic hashes) Secure hashing functions such as MD5 are the perfect mixers for many reasons: they meet the basic requirements of a good mixing function, they've been widely analyzed for security flaws, they're typically faster than either symmetric or asymmetric encryption, and they're not subject to any export restrictions. Public implementations are also widely available.

Limitations
Unlike generating pseudo-random numbers, creating genuine random numbers in software is very slow. For the output to be truly random, many sources must be sampled. Some of the sampling is slow, such as reading from the hard drive or sound card. Furthermore, the sampled input must be mixed using complex algorithms. Game consoles have a more limited selection of input sources compared to PCs, so they will tend to produce less random results. However, newer consoles often have disk drives of some sort (CD-ROM, DVD, hard disk) that can be used as good hardware sources of entropy.

Section 1

General Programming

The randomness of the results depends solely on the level of entropy in the input samples. The more input samples and the more entropy in each sample, the better the output. Keep in mind that the more often this algorithm is invoked in quick succession, the less random the output, because the smaller the change in the input bits. To sum up, this technique is not a replacement for pseudo-RNG. Use this technique for the one-time generation of your RNG seed value or for generating network session keys that can then be used for hours or days.

Implementation
A C++ example of a genuine random number generator is provided on the accompanying CD. Any implementation of this algorithm will naturally be platform dependent. This particular version is specific to the Win32 platform, but is designed to be easily extensible to other platforms. It uses hardware sources of randomness, such as the Intel RNG and sound card input, when those sources are available. In the interests of efficiency and simplicity, it does not use all of the examples listed previously as input, but uses enough to produce a high level of randomness. The primary functionality resides in the GenRand object within the TrueRand namespace. Here is an example use of GenRand to create a genuine seed value:
#include "GenRand. h" // Genuine random number header unsigned int nSeed = TrueRand: :GenRand() .GetRandInt() ;

Here's another example showing the generation of a session key for secure network communication. The Buffer object is a simple wrapper around stof: :toasic__ string<unsigned char>, which provides the functionality we need for reserving space, appending data, and tracking the size of the sample buffer:
TrueRand: : GenRand randGen; TrueRand: : Buffer bufSessionKey = randGen. GetRand( );

The Get/tend () function is the heart of the program. It samples the random inputs, and then uses a strong mixing function to produce the output. This implementation uses MD5 hashing, so the resulting buffer is the length of an MD5 hash (16 bytes). The mCrypto object is a wrapper around the Win32 Crypto API, which includes MD5 hashing. Buffer GenRand: :GetRand() { // Build sample buffer Buffer randlnputs = GetRandomlnputsO ; // Mix well and serve return mCrypto.GetHash( CALG_MD5, randlnputs );

1.19 Genuine Random Number Generation

131

The GetRandomlnputsf) function is the input sampler. It returns a buffer with approximately 10K of sampled data. This function can easily be modified to include more or less input as desired. Because the time spent in the function varies according to system (drive, sound card) access, we can use the hardware latency as a source of random input; hence, the snapshot of the current time at the beginning and end of the function.
Buffer GenRand: :GetRandomInputs() {

// For speed, preallocate input buffer Buffer randln; randln. reserve ( GetMaxRandInputSize() ); GetCurrTime( randln ); GetStackState( randln ); GetHardwareRng( randln ); GetPendingMsgs( randln ); GetMemoryStatus( randln ); GetCurrMousePos( randln ); // // // // // // append time to buffer stack state hardware RNG, if avail pending Win32 msgs memory load mouse position

// . . . etc.
GetCurrTime( randln ); return randln; // random hardware latency

}
Finally, here's one of the input sampling functions. It extracts the current time, and then appends the data to the mRandlnputs buffer object. QueryPerformanceCounter() is the highest resolution timer in Windows, so it provides the most bits of randomness. We can ignore API failures in this case (and many others), because the worst that happens is that we append whatever random stack data happens to be in Perf Counter if the function fails.
void GenRand: :GetCurrTime( Buffer& randln ) { LARGE_INTEGER Perf Counter; QueryPerformanceCounter( &PerfCounter ); // Win32 API Append( randln, PerfCounter );

How Random Is GenRand?


There are many tests for examining the quality of random numbers. One test is the ^ c """) publicly available program ENT [Walker(b)], included on the accompanying CD, mm CD which applies a suite of tests to any data stream. Tests of GenRand () without using any sources of hardware input (including hard drive seek time), and generating a file of 25,000 random integers using GetRandInt() gives the following results: Entropy = 7.998199 bits per byte. Optimum compression would reduce the size of this 100,000-byte file by 0 percent.

132

Section 1

General Programming

Chi square distribution for 100,000 samples is 250.13, and randomly would exceed this value 50 percent of the time. Arithmetic mean value of data bytes is 127.4918 (127.5 = random). Monte Carlo value for Pi is 3.157326293 (error 0.50 percent). Serial correlation coefficient is 0.000272 (totally uncorrelated = 0.0). These results indicate that the output has a high degree of randomness. For instance, the chi square testthe most common test for randomness [Knuth98] indicates that we have a very random generator.

References
[Callas96] Callas, Jon, "Using and Creating Cryptographic-Quality Random Numbers," available online at www.merrymeet.com/jon/usingrandom.html, June 1996. [Eastlake94] Eastlake, D., Network Working Group, et al, "Randomness Recommendations for Security," RFC 1750, available online at www.faqs.org/rfcs/ rfcl750.html, December 1994. ] [Kelsey98] Kelsey, J., et al, "Cryptanalytic Attacks on Pseudorandom Number Generators," available online at www.counterpane.com/pseudorandom_number .html, March 1998. [Intel99] Intel Corporation, "Intel Random Number Generator," available online at http://developer.intel.com/design/security/rng/rng.htm, 1999. [Knuth98] Knuth, Donald, The Art of Computer Programming, Volume 2: Seminumerical Algorithmsi Third Edition. Addison-Wesley. 1998. [Schneier96] Schneier, Bruce, Applied Cryptography, Second Edition. John Wiley & Sons. 1996. [Walker(a)] Walker, John, "HotBits: Genuine Random Numbers Generated by Radioactive Decay," available online at www.fourmilab.ch/hotbits/. [Walker(b)] Walker, John, "ENT: A Pseudorandom Number Sequence Test Program," available online at www.fourmilab.ch/random/.

1.20
Using Bloom Filters to Improve Computational Performance
Mark Fischer, Beach Software
[email protected]

magine the desire to store Boolean information in a bit arraya very simple premise. Simply assign each element in the bit array to a specific meaning, and then assign it a value. In this scenario, it takes 1 bit in the array to store 1 bit of stored information. The bit array faithfully represents its relative value with 100-percent accuracy. This, of course, works best when the stored data is array oriented such as a transient over time or space. However, what if the data is not a linear transientoriented data set?

Bloom's Way
In 1970, Burton H. Bloom published a simple and clever algorithm [Bloom70] in the "Communications of the ACM." In his publication, Bloom suggests using a "Hash Coding with Allowable Errors" algorithm to help word processors perform capitalization or hyphenation on a document. This algorithm would use less space and be faster than a conventional one-to-one mapping algorithm. Using this example, a majority of words (90 percent, for example) could be checked using a simple rule, while the smaller minority set could be solved with an exception list used to catch the instances where the algorithm would report a word as simply solvable when it was not. Bloom's motivation was to reduce the time it took to look up data from a slow storage device.

Possible Scenarios
A Bloom Filter can reduce the time it takes to compute a relatively expensive and routinely executed computation by storing a true Boolean value from a previously executed computation. Consider the following cases where we'd like to improve performance: Determine if a polygon is probably visible from an octree node. Determine if an object probably collides at a coordinate. Determine if a ray cast probably intersects an object at a coordinate.
133

134

Section 1

General Programming

All of these cases fit into a general scenario. Each case involves an expensive computation (CPU, network, or other resource) where the result is a Boolean (usually false) answer. It is important to note that that the word probably is used in each case because a Bloom Filter is guaranteed to be 100-percent accurate if the Bloom Filter test returns a false (miss), but is, at best, only probably true if the Bloom Filter returns true (hit). A Bloom Filter can store the true result of any function. Usually, the function parameter is represented as a pointer to a byte array. If we wish to store the result of a function that uses multiple parameters, we can concatenate the parameters into a single function parameter. In cases where 100-percent accuracy is needed, we must compute the original expensive function to determine the absolute result of the expensive function, if a Bloom Filter test returns true.

How It Works
There are two primary functions in a Bloom Filter: a function for storing the Boolean true value returned from an expensive function, and a function for testing for a previously stored Boolean true value. The storing function will accept input in any form and modify the Bloom Filter Array accordingly. The testing function will accept input in the same form as the storing function and return a Boolean value. If the testing function returns false, it is guaranteed that the input was never previously stored using the storing function. If the function returns true, it is likely that the input was previously stored using the storing function. A false positive is a possible result from the test. If 100-percent accuracy is desired, perform the original expensive function to determine the absolute value. A conventional Bloom Filter is additive, so it can only store additional Boolean true results from an expensive function and cannot remove previously stored values.

Definitions
The high-quality operation of a Bloom Filter requires a high-quality hash function that is sometimes referred to as a message digest algorithm. Any high-quality hash function will work, but I recommend using the MD5 message digest algorithm [RSA01] from RSA Security, Inc., which is available in source code on the Net, and is also documented in RFC 1321. The MD5 hash function takes N bytes from a byte array and produces a 16-byte (128-bit) return value. This return value is a hash of the input, which means if any of the bits in the input change (even in the slightest), the return value will be changed drastically. The return of the hash function, in Bloom terminology, is called the Bloom Filter Key. Bloom Filter Indexes are obtained by breaking the Bloom Filter Key into blocks of a designated bit size. If we choose a Bloom Filter Index bit size of 16 bits, a 128-bit Bloom Filter Key can be broken into eight complete 16-bit segments. If there are remaining bits left over from breaking the Key into complete segments, they are discarded.

1.20 Using Bloom Filters to Improve Computational Performance

135

The number of Bloom Filter Phases used in a Bloom Filter is the number of Bloom Filter Indexes used to store the Boolean value from the expensive function. For example, three phases might be used from a 128-bit key using a Bloom Filter Index bit size of 16 bits. The remaining five indexes will be discarded, in this example. A Bloom Filter Array is used to store the expensive function's Boolean value. For example, if the Bloom Filter Index bit size is 16 bits, the Bloom Filter Array will be 216 bits long, or 64K bits (8K bytes). The larger the array, the more accurate the Bloom Filter test. The Bloom Filter Saturation of the Bloom Filter Array is the percentage of bits set to true in the bit array. A Bloom Filter Array is optimal when saturation is 50 percent, or half of the bits are set and half are not.

Example 1
For an example, we will store the function parameter ("Mikano is in the park") using three phases with an index bit size of 16 bits into an array 64k bits long (8k bytes). In this instance, the expensive function was used to determine if Mikano was truly in the park and the result was yes (true). Although we used a string variable, in this case, any variable format will work. The format of the stored expensive function parameter data is independent of the Bloom Filter performance, accuracy, or memory usage. First, the hash function is computed from the expensive function parameter data. Let's assume that the hash function returned the 128-bit value Oxl0027AB30001BF 7877AB34D976A09667. The first three segments of 16-bit indexes will be 0x1002, 0x7AB3, and 0x0001. The remaining segments are ignored. The Bloom Filter Array starts out reset (all false bits), before we begin to populate the bit array with data. Then, for each of these indexes, we will set the respective bit index in the Bloom Filter Array to true regardless of its previous value. As the array becomes populated, sometimes we will set a bit to true that has already been set to true. This is the origin of the possible false positive result when testing the Bloom Filter Array (Figure 1.20.1). When we wish to examine the Bloom Filter Array to determine if there was a previously stored expensive function parameter, we proceed in almost the same steps as a store, except that the bits are read from the Bloom Filter Array instead of written to them. If any of the read bits are false, then the expensive function parameter was absolutely never previously stored in the Bloom Filter Array. If all of the bits are true, then the expensive function parameter was likely previously stored in the Array. In the case of a true result, calculate the original expensive function to accurately determine the Boolean value (Figure 1.20.2). Tuning the Bloom Filter Tuning the Bloom Filter involves determining the number of phases and the bit size of the indexes. Both of these variables can be modified to change the accuracy and capacity of the Bloom Filter. Generally speaking, the larger the size of the bit array

136

Section 1 General Programming 3 phase, 16-bit (8K Byte) Bloom Filter Bit Value Bit Index 0x0000 0x0001 0x0002 only 3 phase so ignore the rest 0x0003

void store_bloom_data("Mikano is in the park") 128 Bloom Key divided into 8 16-bit segments jwrite 0x1002 write OX7AB3 jwrite 0x0001 Hash OxBF78 Ox77AB Ox34D9 Ox76AO 0x9667

0x1001 0x1002 0x1003 read OX7AB3

boolean test_bloom_data("Mikano is in the park")

0x7 AB4

OxFFFC OxFFFD OxFFFE OxFFFF


w w

boolean test_bloom_data("Mikano is in the office ") "Mikanq is in the office" OxFFFF 0x7 AB3 OxFFFC Hash 0x7063 Ox691E OxB269 0x0110 OxCOOl (potential false positive) , \s j-ead (not set so return false)

r-J
return true

If OxFFFC was also set, then a false positive would be returned. False Positive

FIGURE 1.20.1 Flow of a Bloom Filter.

1.20 Using Bloom Filters to Improve Computational Performance


// returns a pointer to 16 bytes of data that represent the hash void * compute_hash ( pData, nDataLength ); // returns the integer value for the bits at nlndex for nBitLength long int get_index_value( void * pData, int nBitlndex, int nBitLength ); // tests a bit in the Bloom Filter Array. // returns true if set otherwise returns false boolean is_bit_index_set( int nlndexValue ); // sets a bit in the Bloom Filter Array void set_bit_index( int nlndexValue ); void store_bloom_data( void * pData, int nDataLength ) { void *pHash; int nPhases = 3, nPhaselndex = 0, nBitlndexLength = 16; // returns pointer to 16 bytes of memory pHash = compute_hash( pData, nDataLength ); // now set each bit while { nPhaselndex < m nPhases ) Theoretically, a different input parameter could return the same value but that is unprobable. Either way, the algorithm will still work.

137

nlndexValue = get_index_value( pHash, nPhaselndex, nBitlndexLength ); // if bit is not set, we have a miss so return false set_bit_index( nlndexValue ) ; nPhase!ndex++;

boolean test_bloom_data( void * pData, int nDataLength ) void *pHash; int nPhases = 3, nPhaselndex = 0, nBitlndexLength =. 16; // returns pointer to 16 bytes of memory pHash = compute_hash( pData, nDataLength ); // now test each bit while ( nPhaselndex < m nPhases ) compute_hash will always return the same 16 bytes of data when called with the same input parameters.

4-

nlndexValue = get_index_value( pHash, nPhaselndex, nBitlndexLength // if bit is not set, we have a miss so return false if ( !is_bit_index_set( nlndexValue ) ) return( false ); nPhase!ndex++; * s . Return false as soon as we find a false bit. At this point, the expensive function has definitely not been previously stored.

// all bits are set so we have a probably hit. return( true );

FIGURE 1.20.2 Basic use of a Bloom Filter.

Section 1 General Programming

(N) and the more phases, the less likely a false positive response will occur. Bloom asserted that the optimum performance of this algorithm occurs when saturation of the bit array is 50 percent. Statistically, the chance of a false positive can be determined by taking the array saturation and raising it to the power of the number of phases. Other equations are available to tune the Bloom filter algorithm. The equation to calculate the percentage of false positives is: percent_false_pdsitive = saturationnumb"-f-fhases or expressed as a function of percent_false_positive: number_of_j>hases = Logsaturation(percentjalse_fositive) By assuming that the Bloom Filter Array is operating at optimum capacity of 50percent saturation, Table 1 .20. 1 can be computed from the preceding formulas. For example, if we want the false positive rate below half a percent (0.5 percent), eight phases must be used, which will return a worst-case scenario of 0.39-percent false positives. Next, we calculate the Bloom Filter Array bit size. array_bit_size = ( number_of_phases * max_stored_input The array_bit_size is usually rounded up to the nearest value where array_bit_size can be expressed as 2 to the power of an integer. array _bit_size = .2* Finally, compute the index_bit_size from the array_bit_size. array <_bit_size = 2>ndex-bit-"z*
Table 1.20.1 Percentage of False Positives Based on Number of Phases Used percent_false_positive 50.00% 25.00% 12.50% 6.13% 3.13% 1.56% 078% 0.39% number_of_phases

1.20 Using Bloom Filters to Improve Computational Performance

139

Example 2 Suppose we want to store a maximum of 9000 expensive function parameters with at least 95-percent accuracy when the Bloom Filter Array test returns true. From Table 1.20.1, we can determine that five phases will be necessary to obtain an accuracy of equal to or greater than 95 percent and a false positive of less than or equal to 5 percent. 5 phases * 9000 expensive function parameters / -ln(0.5) = 64,921 bits Rounding up to the nearest 2n gives us 64K bits (8K bytes), and because 216 = 64K, the index_bit_size will be 16 bits. Final Notes One way to improve performance is to use an exception list to prevent executing the expensive function, as Bloom did in his algorithm. An exception list contains all of the false positive cases that can be returned from testing a Bloom Filter. This can be computed at parameter storage or dynamically when false positives are detected (Figure 1.20.3). Another way to improve performance is to dynamically build a Bloom Filter Array. If the range of expensive function parameters is too great, Bloom Filters can be calculated dynamically and optimized for repetitive calls to test the bit array. By dynamically building a Bloom Filter Array, the commonly tested expensive function parameters are calculated once, and untested function parameters do not waste space in the bit array.
Standard Bloom Filter Test Code
if ( test_bloom_data(c ) ) boolean bSuccess = false; if ( in_exception_list ( c ) ) return ( bSuccess ) ; f* bSuccess = expensive_f unction (c ) ; if ( ibSuccess ) add_to_excepti return ( bSuccess ) ;

Optional Code

Exception List Test Dynamically computed Exception List Dynamically computed Bloom Filter

-.

else if ( expensive_f unction (c ) ) store_bloom_data ( c ) ; return true;

return false;

FIGURE 1.20.3

Bloom Filter configurations.

140

Section 1

General Programming

Here are some interesting Bloom Filter characteristics: Two Bloom Filter Arrays can be merged together by bitwise ORing them. Bloom Filter Arrays can be shared among parallel clients. Optimized Bloom Filter Arrays are not compressible. Underpopulated Arrays are very compressible. Memory corruption in the array can be mended by setting unknown bits to true.

Conclusion
Bloom Filters offer a method of improving performance of repeatedly called expensive functions at the expense of memory. While this method has been documented for a long time, it remains a relatively unused technique, although exceptions exist, such as Bloom Filter usage in the very popular Web-caching program Squid (www.squidcache.org/) by Duane Wessels. Adding a Bloom Filter algorithm to a program can usually be done in less that 20K bytes of code. As with most performance-enhancing tricks, it is a good idea to add Bloom Filters to a project during the optimization stage, after the main functionality is finished.

References
[BeachOl] Beach Software, "Bloom Filters," available online at http:// beachsoftware.com/bloom/, May 10, 2000. [RSA01] RSA Security, "What Are MD2, MD4, and MD5," available online at www.rsasecurity.com/rsalabs/faq/3-6-6.html, March 4, 2001. [FlipcodeOl] Flipcode, "Coding Bloom Filters," available online at "www.flipcode .com/tutorials/tut_bloomfilter.shtml, September 11, 2000. [Bloom70] Bloom, Burton H., "Space/Time Trade-Offs in Hash Coding with Allowable Errors," Communications of the ACM, Vol. 13, No.7 (ACM July 1970): pp. 422-426.

1.21
3ds max Skin Exporter and Animation Toolkit
Marco Tombesi
[email protected]

e have seen wonderful special effects in modern films that have taken glorious monsters such as dinosaurs and made them move smoothly. We know how they did it (using software such as LightWave, 3ds max, Maya, etc.), but how do we use the same animation technology for our games? This gem is intended as an introduction to a full toolset for that purpose, starting just after the creation of the animated character in 3ds max (and Character Studio), and ending with that object smoothly animating in a game's real-time scenes. Along the way, it passes through the export plug-in and is stored in a custom data format. In this gem, we will go into depth only about the export aspect; the rest is well explained by the code on the accompanying CD. Let's talk about the necessary steps: 1. The animation is done with 3ds max 3.1 (hereafter simply called MAX) and Character Studio 2.2, using Biped and/or bones and the Physique modifier. It should be noted that although newer versions of these tools will become available, the algorithms required for any new versions should be similar. 2. The export plug-in creates a custom format file (.MRC), which consists of: Mesh information (vertices, normals). Skeletal structure (the bone tree). Influence values (weighting) of each bone to vertices of the mesh (one vertex may be influenced by multiple bones). Bone animation: For each bone, this consists of a set of translation and rotation keys (using quaternions), including the exact time in milliseconds from the animation start to when the transformation should be performed. 3. To read the .MRC file, we have a reusable DLL available, provided with full source code. 4. The Tenderer interpolates (linearly or better) between the sample keys and calculates the current transformation matrix to be applied to each bone.

141

142

Section 1

General Programming

This is done using the time elapsed from the animation start, obtaining a smooth and non-hardware-dependent animation. 5. At each frame, the Tenderer recalculates the position of each vertex and its normal. The calculation is based on the current transformation matrix and influence value that each bone has on a particular vertex. Most matrix operations can be done using the graphics hardware transformation and lighting features if they exist (for example, on the GeForce and Radeon cards). The process of exporting the animation data with a plug-in for MAX is not well documented. While there are many Web pages covering skinning techniques, few actually address the issue of exporting the data. Read and study the source code as well as all Readme.txt files in the project directories for this gem on the CD. More information is also available on the authors Web page [TombesiOl], where updates for MAX 4 will be available when it is released. This gem is based on a hierarchical bone structure: a bone tree or a Biped, created using Character Studio 2.2. Build a low polygon mesh (about 5000 triangles). The mesh should be a single selectable object in MAX. Deform the mesh using the Physique modifier, based on the Biped previously created. The character animation should be created on the Biped.

Exporting
First, we need a file format specification.
The MRC File Format

This is a simple file format for the purposes of this gem. It supports normals, bones, vertex weights, and animation keys. See Figure 1.21.1 for a self-explanatory schematic, and check the code on the CD for technical clarification.
Exporting to MRC with the MAX SDK

If you are new to plug-in development and don't know how MAX works, be sure to refer to the MAX SDK documentation. In particular, study the following sections before proceeding: DLL, Library Functions, and Class Descriptors Fundamental Concepts of the MAX SDK Must Read Sections for All Developers Nodes Geometry Pipeline System Matrix Representations of 3D Transformations

FILE

VSTART/

vertCnt

}
normCnt
CO

CD

faceCnt

boneOfs

childCnt
(G T3 LU

43

z o
CO

t_hd

influencedVertexCnt

boneCnt

keyCnt

FILE END

FIGURE 1.21.1

MRC file format description.


143

144
Working with Nodes

Section 1

General Programming

In our export plug-in, we must derive a class from SceneExport and implement some virtual methods, one of which is the main export routine.
class MRCexport : public SceneExport { public: // Number of extensions supported int ExtCount() {return 1;} // Extension ("MRC") const TCHAR * Ext (int n) {return _T("MRC");}

// Export to an MRC file int DoExport( const TCHAR *name, Explnterface *ei, Interface *i, BOOL suppressPrompts=FALSE, DWORD options=0);

//Constructor/Destructor MRCexport () ; -MRCexport();

Accessing scene data requires an Interface passed by MAX to the main export routine (the entry point of the plug-in). For every object in MAX, there is a node in the global scene graph, and each node has a parent (except RootNode) and possibly some children. We can access the root node and then traverse the hierarchy, or we can directly access a node if the user has selected it in MAX before exporting.
INode* pNode = i->GetSelNode(0) ; INode* const pRoot = i->GetRootNode() ;

To navigate the node structure, we have these methods:


Int count = pNode->NumberOfChildren() ; INode* pChNode = pNode->GetChildNode(i) ;

A node could represent anything, so we need to discriminate among object types via the node's class identifier (Class_ID or SuperClassID), and then appropriately cast the object. For our purposes, we need to check if a node is a geometric object (a mesh) or a bone (a Biped node or a bone). bool IsMesh( INode *pNode) {

if(pNode == NULL) return false; ObjectState os = pNode->EvalWorldState(0) ; if(os.obj->SuperClassID() == GEOMOBJECT_CLASS_ID) return true; return false;

1.21 3ds max Skin Exporter and Animation Toolkit

145

bool IsBone(INode *pNode) {

if(pNode == NULL)return false; ObjectState os = pNode->EvalWorldState(0) ; if (los.obj) return false; if(os.obj->ClassID() == Class_ID(BONE_CLASS_ID, 0)) return true;
if(os.obj->ClassID() == Class_ID(DUMMY_CLASS_ID, 0)) return false; Control *cont = pNode->GetTMController() ;
//other Biped parts if( COnt->ClassID() == BIPSLAVE_CONTROL_CLASS_ID ||

//Biped root "Bip01"


COnt->ClassID() == BIPBODY_CONTROL_CLASS_ID ) return true; return false;

The previous example explains how to navigate MAX's nodes and check what they represent. Once we get a mesh node, we need to acquire the desired vertex data.
Getting Mesh Data

For convenience later on, we'll store all vertex data in global coordinate space. MAX object coordinates are in object space, so we need a transformation matrix to be applied to each vertex and normal of the mesh. We can grab this global transformation matrix at any time during the animation using GetObjectTM(TimeValue time). This matrix is used to transform vectors from object space to world space and could be used, for example, if we want to get the world space coordinate of one mesh vertex. We could do this by taking the vertex coordinate in object space and multiplying it (post-multiply in MAX) by the matrix returned from this method. We are interested in mesh data at the animation start, so TimeValue is zero.
Matrix3 tm = pNode->6etObjectTM(0)

MAX uses row vector 1x3 and 4x3 matrices, so to transform a vector, we mustpremultiply it by the matrix. Mart Vertices and other data are not statically stored, but dynamically calculated each time. To access data, we must first perform the geometry pipeline evaluation, specifying the time at which we want to get the object state.

146

Section 1

General Programming

MAX has a modifier stack system, where every object is the result of a modification chain. Starting from a simple parametric primitive (such as a box) that is the base object, the final object is built, applying modifiers in sequence along the stack. This is the object pipeline and we will work with the result. The resulting object is a DerivedObject and has methods to navigate the stack of modifiers. To get the result at a specified animation time, we must first retrieve an ObjectState, which is done by invoking the method EvalWorldState on the node. This makes MAX apply each modifier in the pipeline from beginning to end.
ObjectState os = pNode->EvalWorldState(0);

ObjectState contains a pointer to the object in the pipeline and, once we have this object, we can finally get the mesh data. To do this, we must cast the generic object to a geometric one, which has a method to build a mesh representation.
Mesh& mesh = *(((GeomObject*)os.obj)->GetRenderMesh(0, pNode, . . . ) ) ;

Now it is easy to access the mesh members and finally put vertices, faces, and normals in memory, ready to be written to a file. These methods are available to accomplish this: Mesh::getNumVerts(), Mesh::getNumFaces(), Mesh::getVert(i), anAMesh::getNormal(i). Listing 1.21.1 illustrates how to export mesh data to a file.
Getting the Bone Structure

Now we need a way to write the skeleton's hierarchical structure to an output data file. Starting from the root node, we traverse depth-first through the tree, and for each bone, we need to get several things. First, we assign an index to any direct child and to the bone's parent, and then we grab the bone orientation matrix. tm = pNode->GetNodeTM(0); tm.Invert(); Although very similar, the preceding matrix isn't the object matrix, but is related to the node's pivot point, which may not be the object's origin. Check with the SDK documentation to find a precise description. We will use this r i / , , , , i matrix to transform every mesh vertex from world space to related bone space, so it can move with the bone. Since we have to multiply any vertex by the inverse of this matrix, we can invert it now and save rendering time.

MOTt

3^,,

Getting the Bone Influences

Now we are at the most exciting part of this gem: getting the vertex bone assignment and influence value (weighting). The weighting is important when two or more bones influence the same vertex and the mesh deformation depends on both (see [WoodlandOO] for the theory). These assignments should be done using the Physique modifier in Character Studio 2.2. Note to the reader: Study the Phyexp.h header that comes with Character Studio for modifier interface help.

1.21 3ds max Skin Exporter and Animation Toolkit

147

First, we must find the Physique modifier on the object's node that we wish to export (this is the same node we used earlier to get the mesh vertex data). We do this by accessing the referenced DerivedObject and then scanning each applied modifier on the stack until we find the Physique modifier (using a Class_ID check).
Modifier* GetPhysiqueMod(INode *pNode) {

Object *pObj = pNode->GetObjectRef(); if(lpObj) return NULL; // Is it a derived object? while(pObj->SuperClassID() == GEN_DERIVOB_CLASS_ID) { // Yes -> Cast IDerivedObject *pDerivedObj = static_cast<IDerivedObject*>(pObj); // Iterate over all entries of the modifier stack int ModStacklndex = 0; while(ModStacklndex < pDerivedObj->NumModifiers()) { // Get current modifier Modifier* pMod = pDerivedObj-> GetModifier(ModStacklndex); / / I s this Physique? if(pMod->ClassID() ==
Class_ID(PHYSIQUE_CLASS_ID_A, PHYSIQUE_CLASS_ID_B))

return pMod; // Next modifier stack entry ModStackIndex++;

} pObj = pDerivedObj->GetObjRef(); }
// Not found return NULL;

Now we enter the Bone assignment phase (see Listing 1.21.2; a code overview follows). Once we have the Physique modifier, we get its interface (IPhysiqueExpori) and then access the Physique context interface (IPhyContextExporf) for the object. This owns all of the methods with which we need to work. Each vertex affected by a modifier has an interface IPhyVertexExport. Grab this interface to access its methods, calling GetVertexInterface(i) on the Physique context interface. We must check to see if a vertex is influenced by one or more bones (RIGID_TYPE or RIGID_BLENDED_TYPE, respectively). In the former case, the weight value is 1 and we have to find just a single bone (calling GetNode on the i-th vertex interface). In the latter case, we have to find every bone assigned to the vertex, and for each bone we must

Section 1 General Programming

get its proper weight value by invoking GetWeightQ) on the i-th vertex interface, where j is the j-th bone influencing it. In addition, note that at the end, we must remember to release every interface. Now we are ready for the last phase: bone animation data acquisition.
Getting Bone Animation Keys

This is a simple step. At selected time intervals (default 100 milliseconds), grab the transformation matrix of each bone. In the MAX SDK, time is measured internally in "ticks," where there are 4800 ticks per second, so we must perform a conversion. Then we use this method:
tm = pNode->GetNodeTM(timeTicks);

It's more efficient to not store the complete matrix (16 floats), but instead only the translation (3 floats) and rotation data (4 floats), so we extract a position vector and a unit quaternion from the matrix. Points pos = tm.GetTrans(); Quat quat(tm); Once we have all the data collected in memory, we store everything to disk using the MRC file format. Now it is time to see how to use it all to perform smooth animation in our games.
Put It to Use: The Drawing Loop

In our application, for each frame displayed, we should perform the following steps in sequence.
Get the Exact Time

To make the animation very smooth and not processor dependent, getting the system time is necessary. We update the skeleton structure by cycling through the bone tree and, for each bone, work out the current transformation matrix by linearly interpolating between two sample keys. To find out which sample keys to interpolate between, we require the current real animation time (in milliseconds) from animation start.
Moving the Skeleton

We determine actual bone position and rotation by linear (or better) interpolation and by quaternion interpolation (SLERP or better) between selected sample keys (sample times should enclose the current time). Then, given these data, you can build the current bone animation matrix from the translation and rotation. The math involved, especially in the quaternion calculations, is explained well in the previous Game Programming Gems book [ShankelOO]. To take better advantage of graphics hardware, we perform all matrix calculations using OpenGL functions. This way we

1.21 3ds max Skin Exporter and Animation Toolkit

149

can exploit any advanced hardware features such as transformation and lighting, and performance will be much better! Recalculate the Skin Once the skeleton is moved, it is time to deform the mesh accordingly, with respect to vertex weight assignments. See [WbodlandOO] for a good overview of this topic. It is convenient to check the vertices in bone-major order, traversing depth-first through the bone tree and doing the following passes for each bone. For each vertex influenced by the bone, we refer it to the bone's local coordinate system (multiplying by the bone inverse orientation matrk), and then transform it via the current bone animation matrk. Then, we multiply the vertex coordinates by the influence value (weight) this bone exerts on it. We add the result to the corresponding vertex value stored in a temporary buffer. Now this buffer contains the current vertex coordinates for the skin, at this point in the animation. To finish, we draw the computed mesh using vertex arrays (or better) to gain even more performance.

Listing 1.21.1: Exporting the Mesh to a File


bool ExportMesh (iNode* pNode, FILE *out) { MRCmesh_hdr mHdr; MatrixS tm = pNode->GetObjectTM(0) ; ObjectState os = pNode->EvalWorldState(0) ; int needDelete; Mesh& mesh = *(( (GeomObject*) os.obj )->GetRenderMesh ( 0, pNode, ...)); // write the mesh vertices mHdr.vertCnt = mesh.getNumVerts() ; forfint i = 0; i < mHdr.vertCnt; i++) { Points pnt = mesh.getVert(i) * tm;

//premultiply in MAX

// write vertex normals mesh.buildNormalsO ; mHdr.normCnt = mesh.getNumVerts() ; for(i = 0; i < mHdr.normCnt; Points norm = Normalize(mesh.getNormal(i) ) ;

// build and write faces mHdr.faceCnt = mesh.getNumFaces() ; for(i = 0; i < mHdr.faceCnt;

150

Section 1

General Programming

MRCface_hdr fHdr; fHdr.vert[0] = mesh.faces[i].v[0]; fHdr.vert[1] = mesh.faces[i].v[1]; fHdr.vert[2] = mesh.faces[i].v[2];

Listing 1.21.2: Reading Bone Assignments


bool GetPhysiqueWeights(INode *pNode, INode *pRoot, Modifier *pMod, BoneData_t *BD) { // create a Physique Export Interface for given Physique Modifier IPhysiqueExport *phylnterface = (IPhysiqueExport*) pMod->Get!nterface(I_PHYINTERFACE); if (phylnterface) { // create a ModContext Export Interface for the specific // node of the Physique Modifier IPhyContextExport *modContext!nt = (IPhyContextExport*) phyInterface->GetContext!nterface(pNode) ; // needed by vertex interface (only Rigid supported by now) modContext!nt->ConvertToRigid(TRUE) ; // more than a single bone per vertex modContextInt->AllowBlending(TRUE) ; if (modContextlnt) { int totalVtx = modContextlnt ->GetNumberVertices() ; for(int i = 0; i < totalVtx; i
IPhyVertexExport *vtxlnterface = (IPhyVertexExport*) modContext!nt->GetVertexInterface(i) ; if (vtxlnterface) { int vtxType = vtxInterface->GetVertexType() ; if(vtxType == RIGID_TYPE) {

INode *boneNode = ((IPhyRigidVertex*)vtxInterface) -> GetNode(); int boneldx = GetBoneIndex(pRoot, boneNode); Insert // Build vertex data MRCweightJidr wdata; wdata.vertldx = i; wdata. weight = 1 .Of ; //Insert into proper bonedata BD[ boneldx] . weight sVect .push_back( wdata) ; // update vertexWeightCnt for that bone

1.21 3ds max Skin Exporter and Animation Toolkit

151

BD[boneIdx] .Hdr.vertexCnt = BD[boneIdx] .weightsVect.size() ;

}
else if(vtxType == RIGID_BLENDED_TYPE)

IPhyBlendedRigidVertex *vtxBlended!nt = (IPhyBlendedRigidVertex*)vtxInterface; for(int j = 0; j < vtxBlendedInt->GetNumberNodes() INode *boneNode = vtxBlendedInt->GetNode(j) ; int boneldx = GetBoneIndex(pRoot, boneNode); // Build vertex data MRCweightJidr wdata; wdata.vertldx = i; wdata. weight = vtxBlendedInt->GetWeight(j) ; // check vertex existence for this bone bool notfound = true; for (int v=0; notfound && v < BD[boneIdx] .weightsVect.size() ; // update found vert weight data for that // bone if ( BDfboneldx] .weightsVectfv] .vertldx == wdata.vertldx ) { BD[boneIdx] .weightsVect[v] .weight += wdata. weight; notfound = false;

if (notfound) { // Add a new vertex weight data into proper // bonedata BD[boneIdx] .weightsVect.push_back(wdata) ; // update vertexweightCnt for that bone BD[boneIdx] .Hdr.vertexCnt = BD[boneIdx] .weightsVect.size() ;

phyInterface->ReleaseContextInterface(modContextInt) ;

pMod->Release!nterface(I_PHYINTERFACE, phylnterface) ; } return true;

152

Section 1

General Programming

References
SDK documentation file: [DiscreetOO] Max SDK Plug-in development documentation: SDK.HLP Web links: [TombesiOl] Tombesi, Marco's Web page: http://digilander.iol.it/baggior/ Books: [WoodlandOO] Woodland, Ryan, "Filling the GapsAdvanced Animation Using Stitching and Skinning," Game Programming Gems. Charles Raver Media 2000; pp. 476-483. [ShankelOO] Shankel, Jason, "Matrix-Quaternion Conversions" and "Interpolating Quaternions," Game Programming Gems. Charles River Media 2000; pp. 200-213.

1.22
Using Web Cameras in Video Games
Nathan d'Qbrenan, Firetoad Software
[email protected]

ost games nowadays have multiplayer capabilities; however, the only interaction that goes on among online gamers is the occasional text message. Imagine having the ability to see the expression on your opponent's face when you just pass them before reaching the finish line, or when they get fragged by your perfectly placed rocket. Web cams allow you that functionality, and with high-speed Internet slowly becoming standard, it's becoming feasible to send more data to more clients. This gem demonstrates a straightforward approach to implementing Web cam methodologies into a game. We'll be using Video for Windows to capture the Web cam data, so Windows is required for the Web cam initialization function. We will cover numerous approaches for fast image culling, motion detection, and a couple of image manipulation routines. By die end, we will have a fully functional Web cam application tliat can be run and interacted widi at reasonable frame rates.

Initializing the Web Cam Capture Window


C^^l>3 ONTHICD The following code demonstrates how to use Video for Windows to set up a Web camera window in an application. Note to die reader: when dealing with video drivers from hardware vendors: You can never have too much error checking and handling code (review source code on CD for a more thorough implementation).
// Globals

HWND hWndCam = NULL; BOOL cam_driver_on = FALSE; int wco_cam_width = 160, wco_cam_height = 120; int wco_cam_updates = 400, wco_cam_threshold = 120;
// WEBCAM_INIT

void webcam_init(HWND hWnd) { // Set the window to be a pixel by a pixel large hWndCam = capCreateCaptureWindow(appname,
WS_CHILD | WS_VISIBLE | WS_CLIPCHILDREN | WS_CLIPSIBLINGS,

153

154

Section 1
0,0,

General Programming

hwnd, 0); if(hwndCam) { // Connect the cam to the driver cam_driver_on = capDriverConnect(hWndCam, 1); // Get the capabilities of the capture driver if(cam_driver_on) { capDriverGetCaps(hWndCam, &caps, sizeof(caps)); // Set the video stream callback function capSetCallbackOnFrame(hWndCam, webcam_callback); // Set the preview rate in milliseconds capPreviewRate(hWndCam, wco_cam_updates); // Disable preview mode capPreview(hWndCam, FALSE); // Initialize the bitmap info to the way we want capwnd.bmiHeader.biSize = sizeof(BITMAPINFOHEADER); capwnd.bmiHeader.biWidth = wco_cam_width; capwnd.bmiHeader.biHeight = wco_cam_height; capwnd.bmiHeader.biPlanes = 1; capwnd.bmiHeader.biBitCount = 24; capwnd.bmiHeader.bicompression = BI_RGB; capwnd.bmiHeader.biSizelmage =wco_cam_width*wco_cam_height*3; capwnd.bmiHeader.biXPelsPerMeter = 100; capwnd.bmiHeader.biYPelsPerMeter =100; if(capSetVideoFormat(hWndCam,
{

1,1,

icapwnd,

Sizeof(BITMAPINFO)) == FALSE)

capSetCallbackOnFrame(hwndCam, NULL); DestroyWindow(hWndCam); hWndCam = NULL; cam_driver_on = FALSE;

} else { // Assign memory and variables webcam_set_vars(); { glGenTextures(1, &webcam_tex.gl_bgr); glBindTexture(GL_TEXTURE_2D, webcam_tex.gl_bgr); glTex!mage2D(GL_TEXTURE_2D, 0, 3, webcam_tex.size, webcam_tex.size, 0, GL_BGR_EXT, GL_UNSIGNED_BYTE, webcam_tex.bgr);
glTexParameteri(GL_TEXTURE_2D, GL_REPEAT); GL_TEXTURE_WRAP_S,

1.22 Using Web Cameras in Video Games

155

glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T,

GL_REPEAT); glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER,


GLJ.INEAR); glTexParameterf(GL_TEXTURE_2D, GL_LINEAR); GL_TEXTURE_MAG_FILTER,

glGenTextures(1, &webcam_tex.gl_grey); glBindTexture(GL_TEXTURE_2D, webcam_tex.gl_grey); glTex!mage2D(GL_TEXTURE_2D, 0, 1, webcam_tex.size, webcam_tex.size, 0, GLJ.UMINANCE, GL_UNSIGNED_BYTE, webcam_tex.greyscale); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_REPEAT); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T,
GL_REPEAT); glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR); glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER,

GL_LINEAR);

else { cam_driver_on = FALSE;

The above function retrieves the handle to the Web cam window we're capturing from through the function capCreateCaptureWindow(). We then initialize it with windows properties such as its size, and whether it should be visible. In our case, we do want it to be visible; however, we're only going to set the window to a 1x1 pixel, so it's basically invisible. This is required because we don't actually want to display the image subwindow, but we want to receive the data updates from Windows through the callback function. We then retrieve driver information, set the callback function (more on this later), the number of times per second we want to refresh the Web cam, and then reset all our variables. The driver is then tested to see if it can handle returning the standard bitmap information in which we are interested. Upon success, we initialize all the memory for all our movement buffers, as well as the OpenGL texture. We pull a little trick when deciding how big to make this texture, which will come in handy later on. Based on whatever height we set up our Web cam window to be, we find and allocate our memory to the next highest power of 2. Even though we are allocating a bigger buffer than the Web cam image, we save ourselves an expensive texture resize operation, by just doing a memcpyQ right into the larger buffer at the cost of some small precision loss in the Web cam image.

156 Retrieving Data

Section 1

General Programming

Once we have our video window initialized, we need a way to retrieve the data from the Web cam every frame. To let Windows know which callback function it should send the data to, we must call capSetCallbackOnFrameQ with the address of the callback function. When Windows decides it's time to update the Web cam, it will pass us the bitmap information inside the VIDEOHDR structure. In our case, we'll make the callback function process all the Web cam data to decide if we want to create a texture out of it. We can pass all of that data to the webcam_calc_movement () function for further processing, which will determine if enough data has changed since die last frame, after which, we can update the texture.
// WEBCAM_CALLBACK

// Process video callbacks here LRESULT WINAPI webcam_callback(HWND hwnd, LPVIDEOHDR videojidr) { // Calculate movement based off of threshold if(webcam_calc_movement(video_hd r, webcam_tex.delta_buffer, wco_cam_width, wco_cam_height, webcam^tex.size, wco_cam_threshold)) { webcam_make_texture(videojidr, wco_cam_rendering); } return TRUE;

}
Windows defines the LPVIDEOHDR structure as: typedef struct { LPBYTE DWORD DWORD DWORD DWORD DWORD DWORD videohdr_tag IpData; dwBufferLength; dwBytesllsed; dwTimeCaptured; dwUser; dwFlags; dwReserved[4]; // pointer to locked data buffer // Length of data buffer // Bytes actually used // Milliseconds from start of stream // for client's use // assorted flags (see defines) // reserved for driver

} VIDEOHDR, NEAR *PVIDEOHDR, FAR * LPVIDEOHDR;

Windows saves the Web cam data in the buffer called If Data. This is the primary variable we are interested in, but dwTimeCaptured and some of the flags may prove useful as well. Now that we've captured the data from the Web cam, let's test it to see if it's useful.

1.22 Using Web Cameras in Video Games Motion Detection

157

We now want to weed out any unnecessary frames which have barely changed so we can avoid unnecessary updates to our texture. Updating textures is a notoriously slow operation in a 3D API such as OpenGL. The following source code compares delta buffers, and returns true or false if the given threshold has been breached. Note that returning early when the threshold has been exceeded could optimize this function further; however, that would hamper us from using the delta buffer later on. Ectosaver [FiretoadOO] uses these unsigned bytes of delta movement to calculate the amplitude of the waves it causes, and to determine when there is no one moving around.
// GLOBALS

unsigned char wco_cam_threshold=128; // This is a good amount (0-255)


// WEBCAM_CALC_MOVEMENT

// This is a simple motion detection routine that determines if // you've moved further than the set threshold BOOL webcam_calc_movement(LPVIDEOHDR video_hdr, unsigned char *delta_buff, int webcam_width, int webcam_height, int gl_size, unsigned char thresh) { unsigned char max_delta=0; int i=0, j=0; int length; unsigned char *temp_delta = (unsigned char *)malloc( sizeof(unsigned char)* webcam_width * webcam_height); length = webcam_width * webcam_height; webcam_tex.which_buffer = webcam_tex.which_buffer 7 0 : 1 ; if(!video_hdr->lpData) return FS_TRUE; for(i=0; i<length; i++) { // Save the current frames data for comparison on the next frame // NOTE: Were only comparing the red channel (IpData is BGR), so / / i n theory if the user was in a solid red room, coated in red // paint, we wouldn't detect any movement....chances are this //isn't the case :) For our purposes, it this test works fine webcam_tex.back_buffer[webcam_tex.which_buffer][i] = video_hdr->lpData[i*3]; // Compute the delta buffer from the last frame // If it's the first frame, it shouldn't blow up given that we // cleared it to zero upon initialization temp_delta[i] = abs(webcam_tex.back_buffer[webcam_tex.which_buffer][i] webcam_tex.back_buffer[!webcam_tex.which_buffer][i]);

158

Section 1

General Programming

//Is the difference here greater than our threshold? if (temp_delta[i] > max_delta) max_delta = temp_delta[i] ; // Fit to be inside a power of 2 texture for(i=0; i<webcam_height ; memcpy(&delta_buff [i*(gl_size)] , &temp_delta[i*(webcam_width)] , sizeof (unsigned char)*webcam_width) ; f ree(temp_delta) ; if(max_delta > thresh) return TRUE; else return FALSE;

Manipulating Web Cam Data Get the BGR Pixels Once we've performed all our testing and culling, we are ready to manipulate the data we were sent from Windows. For this, we will simply copy the pixels from die VIDEOHDR data struct (the native format Windows returns is BGR) into a buffer that we've allocated to have a power of 2. Note that this technique avoids resizing the texture data's pixels, as it simply copies the pixels straight over, preserving the pixel aspect ratio. The only drawback to this technique is that it will leave some empty space in our texture, so we're left with a bar of black pixels at the top of the image. We can eliminate that bar by manipulating texture coordinates (once mapped onto 3D geometry) or resizing the texture.
// WEBCAM_MAKE_BGR

void webcam_make_bgr(unsigned char *bgr_tex, unsigned char *vid_data, int webcam_width, int webcam_height, int glsize) { int i; for(i=0; i<webcam_height; i++) { memcpy(&bgr_tex[i*(glsize*3)], &vid_data[i*(webcam_widtn*3)], sizeof(unsigned char)*webcam_width*3);

1.22 Using Web Cameras in Video Games Convert to Grayscale

159

Once we've captured the BGR data, we could convert it to grayscale. This would result in an image that is one-third the size of our regular textures, which would be practical for users who have slow Internet connections, but still want to transmit Web cam data. Here is a function that multiplies each RGB component in our color buffer by a scalar amount, effectively reducing all three color channels to one:
// WEBCAM_MAKE_GREYSCALE

void webcam_make_greyscale( unsigned char *grey, unsigned char *color, int dim) { int i, j; // Greyscale = RED * 0.3f + GREEN * 0.4f + BLUE * 0.3f for(i=0, j=0; j<dim*dim; i+=3, grey[j] = (unsigned char)float_to_int(0.30f * color[i] + 0.40f * color [i+1] + O.SOf * color[i+2]);

Real-Life Cartoons Once we've successfully converted all our data to grayscale, we can manipulate the data to draw the picture in a cartoon-like fashion. This method splits the image into five different levels and six different colors, coloring different ranges of pixel values with solid values. All we have to do is perform some simple comparisons and evaluate each pixel based on our heat intensity constants. The final result is compared against a lookup from either the grayscale buffer or our delta buffer. If we want to see the image every frame (single buffer), we will need to compare against the grayscale. To give different results, we'll assign random color intensities for each pixel based on our heat intensity constants.
// WEBCAM_INIT_CARTOON

void webcam_init_cantoon(cartoon_s *cartoon_tex) { char i; for(i=0; i<3; i++) { // Pick random colors in our range cartoon_tex->bot_toll_col[i] = rand()%255; cartoon_tex->min_toll_col[i] = rand()%255; cartoon_tex->low_toll_col[i] = rand()%255; cartoon_tex->med_toll_col[i] = rand()%255; cartoon_tex->high_toll_col[i] = rand()%255; cartoon_tex->max_toll_col[i] = rand()%255;

tfdefine MIN

CAM

HEAT

50

160

Section 1

General Programming

tfdefine #define #define #define

LOW_CAM_HEAT MED_CAM_HEAT HIGH_CAM_HEAT MAX_CAM_HEAT

75 100 125 150

// WEBCAM_MAKE_CARTOON

void webcam_itiake_cartoon( unsigned char *cartoon, cartoon_s cartoon_tex, unsigned char *data, int dim) { int i, j, n; for(i=0, j=0; j<dim*dim; i+=3, {
if(data[j] < MIN_CAM_HEAT)

for(n=0; n<3; cartoon[i+n] = cartoon_tex.bot_toll_col[n] ;

}
if(data[j] > MIN_CAM_HEAT && data[j] < LOW_CAM_HEAT)

for(n=0; n<3; cartoon [i+n] = cartoon_tex.min_toll col[n];

} if(data[j] > LOW_CAM_HEAT && data[j] < MED_CAM_HEAT)


for(n=0; n<3; cartoon[i+n] = cartoon_tex.low_toll_col[n] ;

}
if(data[j] > MED_CAM_HEAT && data[j] < HIGH_CAM_HEAT)

for(n=0; n<3; cartoon[i+n] = cartoon_tex.med_toll_col[n] ; }


if (data[ j] > HIGH_CAM_HEAT && data[j] < MAX_CAM_HEAT)

for(n=0; n<3; cartoon [i+n] = cartoon_tex.high_toll_col[n] ; } if(data[j] > MAX_CAM_HEAT) for(n=0; n<3; cartoon[i+n] = cartoon_tex.max_toll_col[n] ;

Uploading the New Texture

Now, all that's left is uploading the texture to OpenGL. The first step is to get the color values from Video for Windows. Once the new color values are calculated, we can go on to converting it to grayscale, and then go on to our cartoon Tenderer. Once all the image manipulation is finished, we call glTexSubImage2D() to get it into the appropriate texture. It is then ready for use in a 3D application as a texture.

1.22 Using Web Cameras in Video Games

161

// WEBCAM_MAKE_TEXTURE

void webcam_make_texture(LPVIDEOHDR video, webcam_draw_mode mode) { // Build the color first webcam_make_bgr(webcam_tex.bgr, video->lpData, wco_cam_width, wco_cam_height , webcam_tex.size) ; if (mode == GREYSCALE || mode == CARTOON) webcam_make_greyscale (webcam_tex . greyscale , webcam_tex.bgr, webcam_tex.size) ; // Note: Could also pass in the delta buffer instead of // the greyscale if (mode == CARTOON) webcam_make_cartoon (webcam_tex . bgr , webcam_tex . cartoon , webcam_tex. greyscale, webcam_tex.size) ;

// Upload the greyscale version to OpenGL if (mode == GREYSCALE) { glBindTexture(GL_TEXTURE_2D, webcam_tex.gl_grey) ; glTexSub!mage2D(GL_TEXTURE_2D, 0,0,0, webcam_tex . size , webcam_tex . size , GL_LUMINANCE, GL_UNSIGNED_BYTE, webcam_tex. greyscale) ; } // Upload the color version to OpenGL else { glBindTexture(GL_TEXTURE_2D, webcam_tex.gl_bgr) ; glTexSub!mage2D(GL_TEXTURE_2D, 0,0,0, webcam_tex . size , webcam_tex . size , GL_BGR_EXT, GL_UNSIGNED_BYTE, webcam_tex.bgr) ;

Destroy the Web Cam Window

After we're done using the Web cam, we need to destroy the window and set our callback function to NULL, so Windows knows to stop sending messages to it. In addition, we must free up all the memory we previously allocated to our color, grayscale, and delta buffers.
// WEBCAM_DESTROY

void webcam_destroy(void) { if (cam_driver_on)

162

Section 1

General Programming

capSetCallbackOnFrame(hWndCam, NULL); DestroyWindow(hWndCam) ; hWndCam = NULL; if (webcam_tex . bgr) f ree(webcam_tex.bgr) ; if (webcam_tex . grayscale ) free(webcam_tex. grayscale) ; if (webcam_tex . delta_buf f er) f ree(webcam_tex.delta_buffer) ; if (webcam_tex.back_buffer[0] ) f ree(webcam_tex.back_buffer[0]) ; if (webcam_tex.back_buffer[1 ] ) f ree(webcam_tex.back_buffer[1 ] ) ;

Conclusion
Web cams have a lot of untapped potential that game developers may not realize. They have the ability to be used as input devices, as in the way a mouse is used, by tracking color objects and translating their rotations from 2D to 3D [Wu99] . It's even possible to replace your standard mouse using a Web cam, by performing data smoothing and color tracking algorithms on the input frames.

References
Microsoft Developer Network Library http://msdn.microsoft.com/library/devprods/ vs6/visualc/vcsample/vcsmpcaptest.htm. [FiretoadOO] Firetoad Software, Inc., Ectosaver, 2000 www.firetoads.com. [Wu99] Wu, Andrew, "Computer Vision REU 99" www.cs.ucf.edu/-vision/reu99/ profile-awu.html.

2.1
Floating-Point Tricks: Improving Performance with IEEE Floating Point
Yossarian King, Electronic Arts Canada
[email protected]

Overview
Integers have fixed precision and fixed magnitude. Floating-point numbers have a "floating" decimal point and arbitrary magnitude. Historically, integers were fast and floats were slow, so most game programmers used integers and avoided floats. Integer math was cumbersome for general calculations, but the performance benefits were worth the effort. Hardware costs have come down, however, and todays PCs and game consoles can do floating-point add, subtract, multiply, and divide in a few cycles. Game programmers can now take advantage of the ease of use of floating-point math. Although basic floating-point arithmetic has become fast, complex functions are still slow. Floating-point libraries may be optimized, but they are generally implemented for accuracy, not performance. For games, performance is often more important than accuracy. This gem presents various tricks to improve floating-point performance, trading accuracy for execution speed. Table lookup has long been a standard trick for integer math; this gem shows generalized linear and logarithmic lookup table techniques for optimizing arbitrary floating-point functions. The following sections discuss: The IEEE floating-point standard Tricks for fast float/int conversions, comparisons, and clamping A linear lookup table method to optimize sine and cosine A logarithmic method to optimize square root Generalized lookup table methods to optimize arbitrary floating-point functions The importance of performance measurement

167

168

Section 2

Mathematics

IEEE Floating-Point Format


The IEEE standard for floating-point numbers dictates a binary representation and conventions for rounding, accuracy, and exception results (such as divide by zero). The techniques outlined in this article rely on the binary representation, but are generally not concerned with the rounding and exception handling. If a computer or game console uses the standard binary representation, then these tricks apply, regardless of whether the floating-point handling is fully IEEE compliant. The Pentium III Streaming SIMD Extensions (SSE) and PS2 vector unit both implement subsets of the IEEE standard that do not support the full range of exception handling; however, since the binary representation follows the standard, die tricks in this gem will work with these instruction sets. The IEEE standard represents floating-point numbers with a sign bit, a biased exponent, and a normalized mantissa, or significand. Single precision, 32-bit floatingpoint numbers (a "float" in C) are stored as shown in Figure 2.1.1.

f i r f r ljeg fTijrnrri mm m m m |m m m m mm m m m mm m m m m
" "
23 22

31 30

"

"'

"

"

"

s = sign e = biased exponent m = normalized mantissa floating point number is s x 1 .m x 2'6-127)

FIGURE 2.1.1 IEEE 32-bit floating-point format has a 1-bitsign, 8-bit exponent, and 23-bit mantissa. The exponent is stored as a positive number in biased form, with 127 added to the actual exponent (rather than the more familiar two's complement representation used for integers). The mantissa is usually stored in normalized form, with an implied 1 before the 23-bit fraction. Normalizing in this way allows maximum precision to be obtained from the available bits. A floating-point number thus consists of a normalized significand representing a number between 1 and 2, together with a biased exponent indicating the position of the binary point and a sign bit. The number represented is therefore: n = sxl.mx2 ( e - 1 2 7 ) For example, the number -6.25 in binary is -110.01, or -1 X 1.1001 x 22. This would be represented with s=l,e = 2+l 27= 10000001, m = [1.] 1001, as shown in Figure 2.1.2. Some additional "magic values" are represented using the exponent. When e = 255, m encodes special conditions such as not-a-number (NaN), undefined result, or positive or negative infinity. Exponent e = 0 is used for denormalized numbers numbers so tiny that the range of the exponent overflows 8 bits.

2.1 Floating-Point Tricks: Improving Performance with IEEE Floating Point


s=1
-6.25 decimal "-110.01 binary * -1 x [1.]1001 x 22

169

+ 6 = 2 + 127 = 10000001 m= 1001000... I

FIGURE 2.1.2 The number -6.25 as stored in memory in 32-bit IEEE floating-point format.

Double precision 64-bit floating-point numbers are stored using the same basic format, but with 11 bits for the exponent and 52 for the significand. The exponent is biased by 1023, rather than 127. Double precision numbers require twice the storage space and may be slower to load from memory and to process. For these reasons, double precision should generally be avoided in game code. This gem uses only singleprecision floats.

Floating-Point Tricks
Before getting to the lookup table techniques, this section discusses some useful floating-point tricks that help explain the games you can play with the bit patterns of floating-point numbers. Float/lnt Conversions The lookup table techniques that follow convert a floating-point number to an integer to generate lookup table indices. This operation can be slow; on a Pentium II, for example, casting a float to an int with "(int)f' takes about 60 cycles. This is because the ANSI C standard dictates that casting a float to an int should truncate the fraction, but by default, the FPU rounds to the nearest integer. Casting to an int becomes a function call to a routine that changes the FPU rounding mode, does the conversion, and then changes the rounding mode back. Nasty. Note that the cost of casting between ints and floats is dependent on the compiler and processor with which you are working. As with all optimizations, benchmark this conversion trick against a regular typecast and disassemble the code to see what's actually happening. The conversion can be performed much faster by simply adding 1 x 223 to the floating-point number and then discarding the upper exponent bits of the result. We'll look at the code first, and then analyze why it works. To do this, it is helpful to define the following union, which lets us access a 32-bit number as either an integer or a float.

170 typedef union { int i; float f; } _INTORFLOAT;

Section 2 Mathematics

The INTORFLOAT type is used in code snippets throughout this gem. Note that it makes access to the bit pattern of numbers look very simplein practice, the compiler may be generating more code than you expect. On a Pentium II, for example, floating-point and integer registers are in separate hardware, and data cannot be moved from one to the other without going through memory; for this reason, accessing the members of the INTORFLOAT union may require additional memory loads and stores. Here is how to convert a float to an int: INTORFLOAT INTORFLOAT n; bias; // floating-point number to convert // "magic" number

bias.i = (23 + 127) 23; // bias constant = 1 x 2*23 n.f = 123.456f; // some floating-point number n.f += bias.f; // add as floating-point n.i -= bias.i; // subtract as integer // n.i is now 123 - the integer portion of the original n.f Why does this work? Adding 1 x 223 as a floating-point number pushes the mantissa into the lower 23 bits, setting the exponent to a known value (23 + 127). Subfloating-point
43.25= 1 x 223 = 1 0 1 0 1 1.0 1 + 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 . 0 1

[1 .]00000000000000000101011 01 x 2 23

oTv.i ojoj tTi|? |i jctl o I o I o] o I o I o I o I o I o I o I o I o I o I o [ o I o I o 111 o J11 o 11 h~|


31 30 23 22

o I:'F'l'*ril-'i-'f' o o o o o o o o o o o o o o o o o o o o o o o
1 0 1 0 1 1 =43 integer

FIGURE 2.1.3 The number 43.25 is converted to an integer by manipulating the floating-point format. The underlined bits in the mantissa do not fit in memory and are discarded (with rounding).

2.1 Floating-Point Tricks: Improving Performance with IEEE Floating Point

171

tracting the known exponent as an integer removes these unwanted upper bits, leaving the desired integer in the low bits of the result. These steps are illustrated in Figure 2.1.3 for the number 43.25. On a Pentium II (with everything in cache), this reduces the conversion time from 60 cycles to about 5. Note that it is also possible to write inline assembly code to get the FPU to convert from float to int without changing the rounding modethis is faster than typecasting, but generally slower than the biasing trick shown here. This trick works as long as the floating-point number to be converted does not "overlap" the bias constant being added. As long as the number is less than 223, the trick will work. To handle negative numbers correctly, use bias = ((23 + 127) 23) + (1 22)the additional (1 22) makes this equivalent to adding 1.5 x 223, which causes correct rounding for negative numbers, as shown in Figure 2.1.4. The extra bit is required so that the subtract-with-borrow operation does not affect the most significant bit in the mantissa (bit 23). In this case, 10 upper bits will be removed instead of 9, so the range is one bit less than for positive numbersthe number to be converted must be less than 222. To convert from a float to a fixed-point format with a desired number of fractional bits after the binary point, use bias = (23 - bits + 127) 23. Again, to handle negative numbers, add an additional (1 22) to bias. This is illustrated in Figure 2.1.5, which shows the conversion of 192.8125 to a fixed-point number with two fractional bits. Note that you can use the "inverse" of this trick to convert from integer to floating-point.
floating-point -43-25 = 1.5x2 2 3 = - 1 0 1 0 1 1.0 1 + 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 D _ 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 0.1 1

[1 .]0111111111111111101010011 x 22

31 30

23 22

o SJ'SWfM-X;Q o o o o o o o o o o o o o o o o o o o o o o
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 0 =-44 integer

FIGURE 2.1.4 To convert a negative float to an integer is slightly different than for positive numbers. Here we see the conversion of-43.25. Observe how the rounding applied when the underlined bits are discarded yields the correct negative integer.

172
floating-point 192.8125= 1x223~2 =

Section 2

Mathematics

1 1 0 0 0 0 0 0.1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0.1 1 0 1

[1J00000000000001100000011 01 x 2 23

o ] j I fi} o|iTf iTTil oToToToJoToJo I oj o I o [ o I o I o I o 1111 |o I o I o I o I o I o 11 h0 31 30 23 22 ~~ 33"I"CT1-rP o 0 0 0 0 0 0 0 0 0 0 0 o o o o o o o o o o o


1 1 0 0 0 0 0 0.1 1 =192.75 21.2 fixed-point

FIGURE 2.1.5 Fractional bits can be preserved during the conversion from float to integer. Here, 192.8125 is converted to a fixed-point number with two bits after the binary point.

n.i = 123;

// some integer

n.i += bias.i; // add as integer n.f -= bias.f; // subtract as floating-point // n.f is now 123.0 - the original n.i converted to a float

Usually, int-to-float conversions using typecasts are fast, and thus less in need of a performance-optimizing trick. > Sign Test Because the sign bit of a floating-point number is in bit 31, the same as for integers, we can use the integer unit to test for positive or negative floating-point numbers. Given a floating-point number f, the following two code fragments are (almost) equivalent:
if ( f < O.Of ) INTORFLOAT ftmp; ftmp.f = f; if (ftmp.i < 0) // floating-point compare

// integer compare

Although they are equivalent, the integer compare may run faster due to better pipelining of the integer instruction stream. Try it and see if it helps your code. ("Almost" equivalent because negative 0 will behave differently.)

2.1 Floating-Point Tricks: Improving Performance with IEEE Floating Point Comparisons

173

Since the floating-point format stores sign, exponent, mantissa in that bit order, we can use the integer unit to compare floating-point numbersif the exponent of a is greater than the exponent of b, then a is greater than b, no matter what the mantissas. The following code fragments may be equivalent: if ( a < b ) // floating-point compare

INTORFLOAT atmp, btmp; atmp.f = f; btmp.f = b; if (atmp.i < btmp.i) // integer compare Again, the integer comparison will usually pipeline better and run faster. Note that this breaks down when a and b are both negative, because the exponent and mantissa bits are not stored in the two's complement form that the integer comparison expects. If your code can rely on at least one of the numbers being positive, then this is a faster way to do comparisons. Clamping Clamping a value to a specific range often comes up in games programming, and often we want to clamp to a [0,1] range. A floating-point value /can be clamped to 0 (i.e., set/= 0 if/< 0) by turning the sign bit into a mask, as in the following code snippet: INTORFLOAT ftmp; ftmp.f = f; int s = ftmp.i 31; s = -s; ftmp.i &= s; f = ftmp.f;

// create sign bit mask // flip bits in mask // ftmp = ftmp & mask

s is set to the bits of/shifted right by 31sign extension replicates the sign bit throughout all 32 bits. NOT-ing this value creates a mask of 0 bits if/was negative, or 1 bits if/was positive. AND-ing/with this value either leaves/unchanged or sets/to 0. Net result: if/was negative, then it becomes 0; if it was positive, it is unchanged. This code runs entirely in the integer unit, and has no compares or branches. In test code, the floating-point compare and clamp took about 18 cycles, while the integer clamp took less than five cycles. (Note that these cycle times include loop overhead.) Clamping positive numbers to 0 (set/= 0 if/> 0) is less useful but even easier, since we don't need to flip the bits in the mask. INTORFLOAT ftmp; ftmp.f = f; int s = ftmp.i 31; // create sign bit mask

174 ftmp.i &= s; f = ftmp.f; // ftmp = ftmp & mask

Section 2

Mathematics

Clamping to 1 (set/= 1 if/> 1) can be done by subtracting 1, clamping to 0, and then adding 1.
INTORFLOAT ftmp;

ftmp.f = f - 1.0f; int s = ftmp.i 31; // create sign bit mask ftmp.i &= s; // ftmp = ftmp & mask f = ftmp.f + 1.Of; Note that using conditional load instructions in assembly will generally increase the speed of clamping operations, as these avoid the need for branching, which kills the branch prediction logic in the instruction pipeline.
Absolute Value

This one's easy: since floating-point numbers do not use two's complement, taking the absolute value of a floating-point number is as simple as masking the sign bit to 0. INTORFLOAT ftmp; ftmp.f = f; ftmp.i &= Ox7fffffff; f = ftmp.f; Note that this is much faster than using a compare to determine if/is less than 0 before negating it.

Linear Lookup Tables for Sine and Cosine


Trigonometry is often useful in gamesfor calculating distances and angles, stepping along a circle, or animating a water mesh. The standard math library has all the normal trig functions, but they are slow, and they work on doubles, so they use more memory than needed. In a game, a low-precision calculation is often sufficient. To efficiently compute sine and cosine, we can use a lookup table. A common approach is to use fixed-point math, with angles represented on an integer scale, say, 0 to 1023 to cover the full circle. However, this means that the game programmer needs to understand the library implementation of sine and cosine, and represent his or her angles in the format it requires. By using floating-point tricks for efficient indexing, we can create floating-point trig functions that use standard radians and do not require the programmer to know about implementation details.
sin

Let's implement:
float fsin( float theta );

2.1 Floating-Point Tricks: Improving Performance with IEEE Floating Point

175

This can easily be done with a lookup table. A 256-entry table, covering the range of angles 0 to 271, is initialized as: sintable[i] = (float)sin((double)i * 2.0*3.14159265/256.0) which simply converts i in the range 0-256 to floating-point radians in the range 0 to 2n and takes the sine of the resulting angle. Given this table, the jsin function could be implemented as follows: float fsin( float theta ) { i = (unsigned int)(theta * 256.Of/ (2.0f*3.14159265f));return table[i]; } However, this has two problems: first, it uses the slow float-to-int typecast, and second, if theta is outside the range [0,2Jl), then the function will index out of the table. Both of these problems are solved with this implementation:
#define FTOIBIAS #define PI 12582912.Of 3.l4l59265f // 1.5 * 2*23

float fsin( float theta )

int

i;

INTORFLOAT ftmp;

ftmp.f = theta * (256.Of/(2.0f*PI)) + FTOIBIAS; i = ftmp.i & 255;


return table[i];

This implementation uses the floating-point biasing trick described previously for fast conversion from floating-point to integer. It masks the integer with 255 so that the table index wraps around, always staying in the 0-255 range. Note that if/ exceeds 222, then the float-to-integer conversion trick will fail, so it's still necessary to periodically reduce/to the valid [0,27l) range. This implementation of jsin takes about 10 cycles on a Pentium II (assuming all code and data is in primary cache), as compared with almost 140 cycles for the standard math library implementation of sin (even though sin uses the hardware sine instruction in the FPU). A 256-entry floating-point table takes IK, which should easily stay within cache for the duration of your inner loops. Accuracy is basically eight bits, as constrained by ('"c>?''\ t"ie lkuP table size. The worst-case error can easily be determined from analyzing ^---^ the lookup table (as is demonstrated in the code on the CD). Larger lookup tables increase the accuracy of your results, but will hurt cache performance.

176

Section 2

Mathematics

cos
The cosine function could be implemented in exactly the same way, with its own lookup table, but we can take advantage of the fact that cos(0) = sin(9 + n/2), and use the same lookup table. To do this, we just need to add 256/4 (since adding n/2 means we're adding a quarter of a circle to the angle) to the lookup table index, which we can do at the same time as biasing the exponent. This yields the following implementation: float fcos( float theta ) { int i;
INTORFLOAT ftmp;

ftmp.f = theta * (256. Of /(2.0f*PI)) + FTOIBIAS + 64f;

i = ftmp.i & 255; return table[i] ; Depending on the application, it is often useful to get both sine and cosine at the same time. This can be done more efficiently than computing each separately simply look up sin, and then add 64 to the index and mask by 255 to look up cos. If you need to compute several sines or cosines at once, you can write custom code to interleave the calculations and make it faster still.

Logarithmic Optimization of Square Root


Square roots are useful in games for operations such as computing distances, normalizing vectors, and solving quadratic equations. Despite the presence of a square root instruction built into the FPU, the sqrt function in the standard C library still takes about 80 cycles on a Pentium II CPU, making it another good candidate for optimization. Square root optimization is an interesting use of floating-point bit fiddling, because the logarithmic, multiscale nature of square root allows us to decompose the square root calculation and manipulate the mantissa and exponent separately. Consider the square root of a floating-point number: sqrrtf) = sqrl{\.m x 2') = sqrt(l.m)x2'n So, to compute the square root off, we compute the square root of the mantissa and divide the exponent by 2. However, the exponent is an integer, so if the exponent is odd, then dividing by 2 loses the low bit. This is addressed by prepending the low bit of the exponent to the mantissa, so we have: sqrtff) = sqrt(\.m-x. 2">) x 2 ['/2j where e0 is the low bit of the exponent.

2.1

Floating-Point Tricks: Improving Performance with IEEE Floating Point

177

This is implemented with a 256-entry table for the square root of the truncated mantissa and some additional tweaking for the exponent calculation, as follows: float fsqrt( float f ) INTORFLOAT unsigned int ftmp; n, e;

ftmp.f = f; n = ftmp.i; e = (n 1) & Ox3f800000; // divide exponent by 2 n = (n 16) & Oxff; // table index is eO+m22-m16 ftmp.i = sqrttable[n] + e; // combine results return ftmp.f; The table index is simply the upper bits of the mantissa and the low bit of the exponent (e0). The lookup table contains the mantissa of the computed square roots. The exponent of the square root is computed by shifting the exponent of /by 1 to divide by 2. Since the exponent is biased, this divides the bias by 2 as well as the exponent, which is not what we want. This is compensated for by adding an additional factor to the entries otsqrttable to re-bias the exponent. This fcqrt function takes about 16 cycles on a Pentium II CPUabout five times faster than the C library implementation. Again, this is assuming that everything is in cache. The algorithm is explained in more detail in the code on the CD.

^~-_i^
mma>

Optimization of Arbitrary Functions


Consider an arbitrary floating-point function of one variable:

The techniques just discussed reveal two basic methods for table-based optimizations of general functions. For sine and cosine, the value of x was linearly quantized over a known range and used as a table index to look up y. For square root, the value of x was logarithmically quantized and used as a table index to look up a value. This value was scaled by a function of the exponent of x to get the final value of y. The linear approach rescales a floating-point number and converts it to an integer to generate a lookup table index via linear quantization. This is a simple technique very similar to integer lookup tables, the only wrinkle being die efficient conversion of a floating-point value into an integer index. The logarithmic approach uses the floatingpoint bit pattern directly as a table index, to achieve logarithmic quantization. Both of these techniques can be generalized to the case of arbitrary functions. Depending on the function, the linear or logarithmic approach may be more appropriate.

Section 2 Mathematics

Linear Quantization The fiin function in the previous section can be used as a template for optimizing general functions via linear quantization. Suppose we know that the function will only be used over a limited range x e [A, S). We can build a lookup table that uniformly covers this range, and efficiently calculate the correct index into the table for values of x within the range. The optimized function f is then implemented as:
tfdefine FTOIBIAS tfdefine TABLESIZE tfdefine INDEXSCALE 1258291 2. Of // 1.5 * 2"23 256 ((float) TABLESIZE / ( B - A ) )

float flut( float x ) {

int

i;

INTORFLOAT ftmp;
ftmp.f = x * INDEXSCALE + (FTOIBIAS - A * INDEXSCALE); i = ftmp.i & (TABLESIZE - 1);

return ftable[i] ; The lookup table is initialized with:


ftable[i] = f( (float)i / INDEXSCALE + A );

where /is the full-precision floating implementation of the function. The y?J computation requires two floating-point operations (multiply and add), one integer bitwise mask, and a table lookup. It takes about 10 cycles on a Pentium II CPU. Note that additional accuracy can be obtained for a few more cycles by linearly interpolating the two closest table entries. An API supporting this optimization for general functions is provided on the CD, including optional linear interpolation to increase accuracy. Logarithmic Quantization The linear method treats the range [A,B) uniformly. Depending on the function, a logarithmic treatment may be more appropriate, as in the square root optimization. The basic idea is that the bits of the floating-point representation are used directly as a lookup table index, rather than being manipulated into an integer range. By extracting selected bits of the sign, exponent, and mantissa, we can massage the 1:8:23 IEEE floating-point number into our own reduced precision format with as many bits as we like for the sign, exponent, and mantissa. In the square root example, we extracted 8 bits to give a logarithmically quantized 0:1:7 representation. We used 1 bit of the exponent and 7 bits of the mantissa. The sign bit was discarded, since the square root of a negative number is undefined. The 0: 1 :7 format represents an 8-bit mantissa (remember the implied 1 in the IEEE rep-

2.1 Floating-Point Tricks: Improving Performance with IEEE Floating Point

179

resentation) and a 1-bit exponent, so it can represent numbers between [1]. 0000000 x 2 and [1] . 1 1 1 1 1 1 1 x 21, which covers the range [1 ,4). The square root function was decomposed into an operation on the 0:1:7 quantized number (a table lookup) and an independent operation on the exponent (divide by 2). Additional trickery was employed to optimize the two independent operations and combine the mantissa and exponent into a 32-bit floating-point result. Other functions can benefit from this method of logarithmic quantization. The IEEE format makes it easy to extract the least significant bits of the exponent with the most significant bits of the mantissa in a single shift and mask operation. To extract ebits of the exponent and mbits of the mantissa, simply do this: bits = (n (23 - mbits)) & ((1 (ebits + mbits)) - 1) This shifts the number n to the right so that the desired bits of the mantissa and exponent are the rightmost bits in the number, and then masks off the desired number of bits. The sign bit can be handled with some extra bit fiddling, depending on the function with which you are working. If you know that you are only dealing with positive numbers (for example, square root), or that your function always returns a positive result, then you can ignore the sign. If the sign of your result is the same as the sign of the input number (in other words, f(-x) = -f(x)), you can simply save and restore the sign bit. For functions with a limited range of input values, masking out selected bits of the exponent and mantissa can give you a direct table index. For example, if you only care about your function over the range [1,16), then you can use 2 bits of exponent and 4 bits of mantissa (for example). This 0:2:4 representation stores binary numbers between 1 .0000 x 2 and 1 . 1 1 1 1 x 23, or decimal 1 .0 to 1 5 . 5 . Mask out these bits and use the bits directly as an index into a precomputed 64-entry table. This requires very few cycles and is computationally fast. However, as you add more precision, the table grows and may become prohibitively large, at which point cache performance will suffer. An alternative is to decompose the exponent and mantissa calculations, as was done in the square root example. If your function f(x) can be decomposed as:

then you can, for example, approximate fl with a 256-entry lookup table, using 8 bits of the mantissa m, and perform the calculation of f2 directly, as an integer operation on the exponent e. This is essentially the technique used by the square root trick. Logarithmic quantization is a powerful tool, but often requires function-specific bit fiddling to optimize a particular function. Fully general techniques are not always possible, but the methods described in this section should be helpful when tackling your specific optimization problem.

180

Section 2

Mathematics

Performance Measurement
When optimizing code, make sure you measure performance carefully before and after making the optimization. Sometimes an optimization that looks good on paper causes trouble when implemented, due to cache behavior, branch mispredictions, or poor handling by the compiler. Whenever you make changes, be sure that you are improving your performancenever assume. Make sure compiler optimizations are enabled. Use inline functions where appropriate. Again, test your results when using inline functions or tweaking the compiler settings. When benchmarking code, take care that compiler optimization isn't getting in the way of your tests. Disassemble your code and step through it to be sure it's running what you expected. When timing things, it's often helpful to run something in a loopbut if the body of your loop gets optimized out, then your timing won't be very accurate! On Pentium computers, you can use the rdtsc (read time stamp counter) instruction to get the current CPU cycle count. Intel warns that this instruction should be executed a couple times before you start using the results. Intel also recommends using an instruction such as cpuid that will flush the instruction cache, so that you get more consistent timing results. To get absolute times, the cycle counts can be converted to seconds by dividing by the execution speed (MHz) of the processor. Cycle counters are the most reliable way to measure fine-grain performance. Other tools such as VTune and TrueTime (on the PC) are useful for higher level profiling. For any benchmarking, make sure that memory behavior is realistic, as memory bottlenecks are one of the most serious impediments to high performance on modern processors. Be aware of how your benchmark is using the cache, and try to simulate the cache behavior of your game. For a benchmark, the cache can be "warmed up" by running the algorithm a couple of times before taking the timing measurements. However, a warm cache may not emulate the behavior of your gamebest is to benchmark directly in the game itself. Disable interrupts for more reliable results, or take measurements multiple times and ignore the spikes. All the cycle times reported in this gem are from an Intel Pentium II 450-MHz machine. Each operation was repeated 1000 times in a loop, with the instruction and data cache warmed by running the test multiple times. Cycle counts include loop overhead. See the code on the CD for actual benchmarks used. The lookup table techniques described in this article are appropriate if the lookup table remains in cache. This is probably true within the inner loop of your rendering pipeline or physics engine, but it's probably not true if you are calling these functions randomly throughout the code. If the lookup tables cannot be kept in cache, then techniques that use more computation and fewer memory accesses are probably more appropriatemethods such as polynomial approximation (see [EdwardsOO] for a good overview).

2.1 Floating-Point Tricks: Improving Performance with IEEE Floating Point

181

Conclusions
This gem scratched the surface of floating-point optimization. Lookup tables are the primary method explored, and they often produce significant speedups. However, be aware of cache behavior, and always benchmark your results. Sometimes you can achieve the same result faster by using more computation but touching memory lesstechniques such as polynomial approximation may be appropriate. The tricks shown here can be extended in a variety of ways, and many other tricks are possible. As a popular book title suggests, there is a Zen to the art of code optimization, and a short overview like this can't hope to cover all possibilities.

References
[Abrash94] Abrash, Michael, Zen of Code Optimization, Coriolis Group, 1994. [EdwardsOO] Edwards, Eddie, "Polynomial Approximations txp Trigonometric Functions," Game Programming Gems, Charles River Media, 2000. [IntelOl] Intel Web page on floating-point unit and FPU data format. Good for PCs, but relevant to any IEEE-compliant architecture. Available at http://developer .intel.com/design/intarch/techinfo/Pentium/fpu.htm. [Lalonde98] Lalonde, Paul, and Dawson, Robert, "A High Speed, Low Precision Square Root," Graphics Gems, Academic Press, 1998.

2.2
Vector and Plane Tricks
John Olsen, Microsoft
[email protected]

our collision detection routine is running flawlessly now, returning a surface point and a normal back when you feed it a position and velocity vector. Now what? Actually, there are quite a few things you can do based on the data you have. In general, you may want to have your collision routine generate the actual collision point, but the methods in this gem show how to handle collision results that show only the plane of intersection. Since a plane can be fully specified with a surface normal and any point on the plane, you can work your way through the math to find everything you need from there. Data that goes into your collision code would be an initial point Pt and a final point Pf, and the output in the case of a collision would be a plane that is defined by a unit vector surface normal TV and a point on the surface Ps. The point need not be the actual intersection point as long as it is on the plane. For optimization purposes, you will probably want to build a subset of these calculations back into your collision code. Much of the information you need will have already been calculated during the collision tests. It will be much faster to reuse the already known information rather than recalculate it from scratch. The plane equation Ax + By + Cz + D = 0 maps onto the supplied data, where x, y, and z are the components of the normal vector N, and D is the dot product N* Ps.

Altitude Relative to the Collision Plane


One of the most commonly used pieces of data when checking collisions is the altitude of one of your data points. If the altitude is positive, the point is above the surface and has not collided yet. If it is negative, you have collided and penetrated the surface. Typical collision testing code will only return a hit if one of your test points is on each side of the test surface. This means that if you want to predict collisions, you need to pass along a position with an exaggerated velocity vector. That way, the exaggerated vector will intersect much earlier than your actual movement would allow. Once you have tricked your collision code into returning a surface point and normal, you can get your altitude relative to that surface by using your initial position. The final position is not used for this altitude calculation.
182

2.2 Vector and Plane Tricks

183

FIGURE 2.2.1 Determining the altitude.

As shown in Figure 2.2.1, we want to find the length of the vector (Ps - Pt) when it is projected onto the surface normal N. This gives us the distance of the shortest line from the point to the surface. This shortest vector is by definition perpendicular to the surface. This is exactly what the dot product gives us, so we are left with the scalar (nonvector) distance to the surface Ds as shown:

Nearest Point on the Surface Once we have the distance to the surface, it takes just one more step to get the point on the surface PH that is closest to the initial point Pf as also shown in Figure 2.2. 1 . We already know the point is distance Ds from die starting point, and that distance is along the surface normal TV. That means die point can be found with the following:

Ptt = P,-DsN
The normal vector is facing the opposite direction of the distance we want to measure, so it needs to be subtracted from the starting point.

Pinning Down the Collision Point


When you have one point on each side of a test surface, your vector must at some point intersect with it. Finding this intersection point Pc will tell you where your vector pierces die surface. If your collision detection code has not already provided you with the exact point, here is how you would find it. You know that the collision point must lie somewhere along the line. Knowing in advance that there is a collision makes it possible to take some shortcuts since we know there actually is a solution to the equation. If the test ray is parallel to the surface, the ratio cannot be calculated since it results in a divide by zero. We can take advantage of the calculation for Ds in finding die collision point Pc. Figure 2.2.2 shows the information needed for this calculation.

184

Section 2 Mathematics

FIGURE 2.2.2 Finding the collision pointPc.

Since we know the collision is between the two points, we can find it by calculating how far it is along the line from P/ to Pf. This ratio can be written as:

R = ((Pi - P,) N) I ((% - Pf) N)


Or, using our already computed Ds, it becomes:

R = D,I ((P,. - Pf) N)


The two line segments are arranged to both point in the same direction relative to the surface normal, which guarantees that our ratio will be non-negative. Once we have this ratio, we can use it to multiply the length of the vector from P, to Pyto tell how far from P, the collision occurs. In the special case of R = 1, you can avoid the calculation since it results in the point Pf For R = 0, the point is P,-. Otherwise, the following equation is used:

Pe=Pf + R(Pf - P^ Distance to the Collision Point


Although similar to Ds, this differs from the distance from the collision plane because the distance is calculated along the path of travel rather than along the surface normal. In the case of travelling near the surface but nearly parallel to it, your distance to collision will be very large when compared to your altitude. This is the type of calculation you would want to use when calculating altitude for an aircraft, since you cannot guarantee the direction of a surface normal on the ground below. Rather than sending your actual velocity for a collision test, you would send your current position and a very large down vector, long enough to guarantee that it will intersect the ground. This works in the case of intersecting a small polygon

2.2 Vector and Plane Tricks

185

FIGURE 2.2.3 Calculating distance to the collision point.

that happens to be aligned nearly perpendicular to the test vector. In that case, the altitude relative to the collision plane Pn as calculated earlier would give a very small number. Once you have the actual collision point, it's very easy to calculate the distance using Euclid's equation to find how far it is from the starting point to the collision point. Figure 2.2.3 shows the elements required. The distance to the collision point, Dc, is the magnitude of the vector from our starting point /J to the collision point Pe that was calculated earlier:

Another way to describe the magnitude of this vector is that it is the square root of the sum of the squares of the differences of each component of the vector. Most vector libraries include a function call to find the magnitude or length of a vector. Vector magnitudes are never negative. Another possible shortcut can be used when you know the ratio R used to find the collision point as described in the previous section. The distance to the collision point is the length of the full line (which you may already have lying around) multiplied by the already computed ratio. D. = RPf-Pi

Reflecting Off the Collision Plane


The usual result of a collision is to bounce. The interesting part is figuring out the direction and position once you have rebounded off a surface. Figure 2.2.4 shows the elements used in calculating the reflected vector. The first two cases will perfectly

Section 2 Mathematics

2N((Ps-Pf)-N)
P.

P.-Pf

FIGURE 2.2.4 Calculating the reflected vector.

preserve the magnitude of the velocity. In both cases, the result of the bounce will be the same distance from the plane as Pf. One of the simplest ways to visualize reflecting a point relative to a plane is to imagine a vector from the below-ground destination point back up to the surface along the surface normal. The new reflected location is found by continuing that line an equal distance to the other side of the plane. You obtain the new location by adding to the final point twice the distance from the final point to the surface. Reusing our equation to find the distance perpendicular to the plane, we come up with the following. The distance is multiplied by the surface normal to turn it back into a vector since you cannot add a simple scalar value such as Ds to a vector.

The original and reflected vectors will have the same angle relative to the plane. Another way of looking at this is that if you normalize the vectors from your collision point going out to both your original start point and your reflected point, then find a dot product of each with your surface normal; they will be equal. Vectors are normalized by dividing the vector by its length or magnitude, so the statement about reflected vectors in the previous paragraph can be written as:

p~p,
P.-P.

-c

p - p.
P.-P;

Any point on the plane could be substituted for Pc (Ps works, for instance) in the preceding equation and the same result would hold since all we are saying here is that the ends of the unit vectors are the same distance from the plane. A complication with reflections is that the newly determined end point needs to be tested all over again with your collision code to see if you have been pushed through some other surface. If you repeat the collision test, but with a vector from your collision point Pc to the newly reflected point Pr, you will get a possible new collision. You will need to repeat this process until no collision occurs. At each pass, your

2.2 Vector and Plane Tricks

187

vector will be smaller as it is chewed up by bouncing, as long as you do not try to bounce between two coincident planes. There is one degenerate case you should also watch out for when chaining collisions together. Should you hit exactly at the intersection of two planes, your second test will be made with an initial point on the edge of the surface. This is easy to handle if you know about this problem and allow for it in advance by assuming anything exactly at the surface (or on the edge) has not yet collided, and is counted as above the surface for collision purposes. Collisions are reserved for penetration distances greater than zero. Once you have completed your final reflection, a new velocity vector can be computed by normalizing the direction from the last collision to the final reflected location and multiplying it by the original velocity like this: (Pr - PcPt - P f

V =

P.-P.
Kickback Collision Sometimes, rather than reflect off the collision plane, you want to kick the player back the way he or she came as shown in Figure 2.2.5. The calculations for this are simple once you have the collision point. Since this collision also preserves velocity, it is also perfectly elastic. The point to which you are kicked back, /^, is obtained by calculating the vector from your final point Pf back to your collision point Pe and adding it to the collision point.

We can combine terms to get:


Pk =
2P

c - Pf

FIGURE 2.2.5 Calculating a kickback vector.

188

Sections

Mathematics

You can run into the same problems with kickback collisions as with reflections where the destination point leads to an additional collision. However, there is an early way out of the loop for kickback collisions in some cases. If the collision point is more than halfway from the initial point to the final point, the resulting kickback point will occur in an area that has already been checked for collisions, so no additional check is necessary.

Collisions with Damping


Should you want to perform a collision with some sort of friction or damping, you will need to be careful of how you handle the vectors. You will need a scalar value S that will be used as a multiplier to change your velocity at each impact. It will typically range from zero (the object stops on impact) to one (a completely elastic collision preserving velocity). Energy can be injected into the system by increasing the scalar above one. A kickback collision is the same as using a scalar of negative one. To properly handle nonelastic collisions, you must scale only the portion of the vector from the collision point Pe to the reflected point Pr as shown in Figure 2.2.6, since that is the only portion of the flight that will have been slowed due to the impact. The following equation relies heavily on the earlier equations to determine what your new slowed point would be.

In coordination with this, you would need to multiply any separately stored velocity vector by the same scalar value, or your object will resume its full speed the next frame. In the case of a collision putting your point into another immediate collision as discussed earlier, this scale factor should be applied at each pass to simulate the damping effect of multiple bounces in a single frame.

FIGURE 2.2.6 Calculating a damped reflection vector.

2.2 Vector and Plane Tricks Interpolation Across a Line or Plane

189

An interesting side note about lines and planes is that a weighted average interpolation between any set of points occupies the space defined by those points. For instance, starting with a line, you can assign weights to the end points where the weights sum to one, and all possible resulting points are on the line defined by the points. Adding another point to build a plane extends the rule so the sum of the three weights must equal one in order for the weighted sum to remain on the plane defined by the three points. Additional points on the plane may be added, and additional dimensions may also be added should you have need for a point on an n-dimensional plane. This trick is related in a roundabout way to the reason we can often use Ps and Pc interchangeably in several of the previous equations. Either point is sufficient to fill the needs of the plane equation. It's also interesting to note that the individual weights don't need to be between zero and one. They just need to all sum up to the value one, which allows the resulting point to be outside the line segment or polygon defined by the points while still being on the extended line or plane.
Sphere-to-Plane Collision

Colliding a ball with a surface is a little bit more complex than colliding a point against the surface. One way to approach it is through ratios. If you draw a line from the start point P to the line-based collision point Pc of the vector against the plane, die ball center Ph will be somewhere on that line when the ball begins to intersect the plane. When the ball just touches the surface, we can compare the line from Pf to Pc to the line Pf to Pn to gain the information we need. If you project the first line onto the second, the ball center is off the surface by exactly the ball radius r on the line Pj to Pn. Since the length of that line is known to be Ds, we can get a ratio of how far the ball is along the line. This is similar to the way we used a ratio to find the surface collision point Pc. This ratio is the same when applied to the line from P{ to Pc, which leads to the equation:

(Pe ~ Pi)

D S

The equation can be solved for the location of the ball /^, resulting in:
p -p Pb-lC V<-Pi* ~

Section 2 Mathematics

FIGURE 2.2.7 Colliding with a sphere.

Figure 2.2.7 shows the relation graphically, indicating where the sphere is in relation to the vectors. Care must be taken to notice that the ball does not actually reach Pc as the ball touches the surface unless the line from Pf to Pc is perpendicular to the surface. As the vector conies closer to being parallel to the plane, the ball will be farther from Pc when it touches the plane.

2.3
Fast, Robust Intersection of 3D Line Segments
Graham Rhodes, Applied Research Associates
[email protected]

he problem of determining the intersection of two line segments comes up from time to time in game development. For example, the line/line intersection problem can be beneficial in simple collision detection. Consider two objects in threedimensional space that are moving in time. During a time step or animation frame, each object will move from one point to another along a linear path. The simplest check to see if the objects collide during the time step would be to see how close the two linear paths come to crossing, and if they are within a certain distance of each other (in other words, less than the sum of the radii of bounding spheres of the objects), then process a collision. Other common applications for line segment intersections include navigation and motion planning (for example, when combined with an AI system), map overlay creation, and terrain/visibility estimation. This gem describes a robust, closed form solution for computing the intersection between two infinite lines or finite-length line segments in three-dimensional space, if an intersection exists. When no intersection exists, the algorithm produces the point along each line segment that is closest to the other line, and a vector between the two nearest points.

What Makes This Algorithm Robust?


The algorithm presented here is robust for a couple of reasons. First, it does not carry any special requirements (for example, the line segments must be coplanar). Second, it has relatively few instances of tolerance checks. The basic algorithm has only two tolerance checks, and these are required mathematically rather than by heuristics.

The Problem Statement


Given two line segments in three-dimensional space, one that spans between the points j4i = [Alx Aiy Alz]r and A2 = [A^ A2y A2z\T and one that spans between the points Bl = [5lx Biy Blz\ rand B2 = [B^ B2y B2z]T, we would like to find the true point of intersection, P=[PX Py PZ]T, between the two segments, if it exists. "When

191

Section 2 Mathematics

FIGURE 2.3.1 Two line segments in three-dimensional space. A) An intersection exists. B) No intersecton.

no intersection exists, we would like to compromise and find the point on each segment that is nearest to the other segment. Figure 2.3.1 illustrates the geometry of this situation. The nearest points, labeled C and D respectively, can be used to find the shortest distance between the two segments. This gem focuses on finding the nearest points, which are identical to the true intersection point when an intersection exists.
Observations

Before delving into how to solve the line intersection problem, it can be useful to make a few observations. What are the challenges to solving the problem correctly? Consider an arbitrary, infinite line in space. It is likely that the line will intersect an arbitrary plane (if the line is not parallel to the plane, then it intersects the plane); however, it is unlikely that the line will truly intersect another line (even if two threedimensional lines are not parallel, they do not necessarily intersect). From this observation, we can see that no algorithm designed to find only true intersections will be robust, capable of finding a result for an arbitrary pair of lines or line segments, since such an algorithm will fail most of the time. The need for a robust algorithm justifies the use of an algorithm that finds the nearest points between two lines, within a realtime 3D application such as a game. Since every student who has taken a basic planar geometry class has solved for the intersection of lines in a two-dimensional space, it is useful to consider the relationship between the three-dimensional line intersection problem and the two-dimensional intersection problem. In two-dimensional space, any two nonparallel lines truly intersect at one point. To visualize what happens in three-dimensional space, consider a plane that contains both defining points of line A, and die first defining point of line

2.3 Fast, Robust Intersection of 3D Line Segments

193

B. Line A lies within the plane, as does the first defining point of line B. Note that die point of intersection of die two lines lies on die plane, since diat point is contained on line A. The point of intersection also lies on line B, and so two points of line B lie widiin die plane. Since two points of line B lie in the plane, the entire line lies in the plane. The important conclusion here is diat whenever there is a true intersection of two lines, those two lines do lie widiin a common plane. Thus, any time two threedimensional lines have a true intersection, the problem is equivalent to a twodimensional intersection problem in die plane diat contains all four of the defining points.
NaTve Solutions

C_^l^_^ ONTHCCO

A naive, and problematic, solution to die intersection problem is to project the two segments into one of the standard coordinates planes (XY, YZ, or XZ), and then solve the problem in the plane. In terms of implementation, the primary difficulty widi diis approach is selecting an appropriate plane to project into. If neither of the line segments is parallel to any of the coordinate planes, dien the problem can be solved in any coordinate plane. However, an unacceptable amount of logic can be required when one or both segments are parallel to coordinate planes. A variation on diis approach, less naive but still problematic, is to form a plane equation from three of the four points, Aly A2, Bl3 and B2, project all four points into the plane, and solve the problem in the plane. In the rare case that there is a true intersection, this latter approach produces the correct result. One key feature that is completely lacking from the basic two-dimensional projected intersection problem is the ability to give a direct indication as to whether a three-dimensional intersection exists. It also doesn't provide the three-dimensional nearest points. It is necessary to work backwards to produce diis vital information. The biggest problem with either variation on the projected solution arises when the two lines pass close to one anodier, but do not actually intersect. In this case, the solution obtained in any arbitrary projection plane will not necessarily be the correct pair of nearest points. The projection will often yield completely wrong results! To visualize this situation (which is difficult to illustrate on a printed page), consider the following mind experiment. There are two line segments floating in space. Segment A is defined by die points (0, 0, 0) and (1, 0, 0), and segment B is defined by (1, 0, 1) and (1, 1, 1). When the lines are viewed from above, equivalent to projecting the lines into the XY plane, the two-dimensional intersection point is (1, 0, 0), and the threedimensional nearest points are (1, 0, 0) and (1, 0, 1). These are the correct nearest points for the problem. However, if those two lines are viewed from different arbitrary angles, the two-dimensional intersection point will move to appear anywhere on the two line segments. Projecting the two-dimensional solution back onto the direedimensional lines yields an infinite number of "nearest" point pairs, which is clearly incorrect. The test code provided on the companion CD-ROM is a useful tool to see

194

Section 2 Mathematics

this problem, as it allows you to rotate the view to see two line segments from different viewing angles, and displays the three-dimensional nearest points that you can compare to the intersection point seen in the viewport. In the next section, I derive a closed-form solution to the calculation of points C and D that does not make any assumptions about where the two line segments lie in space. The solution does handle two special cases, but these cases are unavoidable even in the alternative approaches.

Derivation of Closed-Form Solution Equations


Calculating the Nearest Points on Two Infinite Lines

The equation of a line in three-dimensional space can be considered a vector function of a single scalar value, a parameter. To derive a closed-form solution to the nearestpoint between two 3D lines, we first write the equation for an arbitrary point, C = [Cx Cy CJr, located on the first line segment, as Equation 2.3.1. C = Al+sLA, where LA = (A2 - 4) (2.3.1)

Notice that Equation 2.3.1 basically says that the coordinates of any point on the first segment are equal to the coordinates of the first defining point plus an arbitrary scalar parameter s times a vector pointing along the line from the first defining point to the second defining point. If s is equal to zero, the coordinate is coincident with the first defining point, and if s is equal to 1, the coordinate is coincident with the second defining point. We can write a similar equation for an arbitrary point, D = [Dx Dy DJT, located on the second line segment, as Equation 2.3.2: D = 5; + tLB, where LB = #, - B (2.3.2)

Here, t is a second arbitrary scalar parameter, with the same physical meaning as s with respect to the second line segment. If the parameters s and t are allowed to be arbitrary, then we will be able to calculate points C and D as they apply to infinite lines rather than finite segments. For any point on a. finite line segment, the parameters s and t will satisfy 0 < s,t < 1 . We'll allow s and t to float arbitrarily for now, and treat the finite length segments later. The two 3D line segments intersect if we can find values of s and t such that points C and D are coincident. For a general problem, there will rarely be an intersection, however, and we require a method for determining s and t that corresponds to the nearest points C and D. The remainder of the derivation shows how to solve for these values of; and t. First, subtract Equation 2.3.2 from Equation 2.3.1 to obtain the following equation for the vector between points C and D:

2.3 Fast, Robust Intersection of 3D Line Segments

195

C-D = -AB + sLA - tLB = [o 0 OJr


where AB = Bl - Al (2.3.3)

Here, since we would like for points C and D to be coincident, we set the vector between the points to be the zero vector. The right side of Equation 2.3.3 can then be represented by the following matrk equation:
s*

ABy
LA. -L,

(2.3.4)

AB.

There are three rows in Equation 2.3.4, one for each coordinate direction, but only two unknowns, the scalar values s and t. This is a classic over-determined or under-constrained system. The only way there can be an exact solution is if the coefficient matrix on the left side turns out to have rank 2, in which case the three equations are equivalent to just two independent equations, leading to an exact solution for s and t. Geometrically, when there is an exact solution, the two lines have a true intersection and are coplanar. Thus, two arbitrary lines in three-dimensional space can only have a true intersection when the lines are coplanar. The difference between the left side and right side of Equation 2.3.4 is equal to the vector representing the distance between C and D. It is also the error vector of Equation 2.3.4 for any arbitrary values of s and t. We determine the nearest points by minimizing the length of this vector over all possible values of s and t. The values of s and t that minimize the distance between C and D correspond to a linear least-squares solution to Equation 2.3.4. Geometrically, the least-squares solution produces the points C and D. When we have the case of the segments being coplanar but not parallel, then the algorithm will naturally produce the true intersection point. Equation 2.3.4 can be written in the form: M? = b, where 7 = I s t\ (2.3.5)

One method for finding the least-squares solution to an over-determined system is to solve the normal equations instead of the original system [Golub96]. The normal equations approach is suitable for this problem, but can be problematic for general problems involving systems of linear equations. We generate the normal equations by premultiplying the left side and right side by the transpose of the coefficient matrk M. The normal equations for our problem are shown as Equation 2.3.6. MTM? = MT , where Mris the transpose of M. (2.3.6)

196

Section 2 Mathematics

Equation 2.3.6 has the desired property of reducing the system to the solution of a system of two equations, exactly the number needed to solve algebraically for values of / and t. Let's carry through the development of the normal equations for Equation 2.3.4. Expanding according to Equation 2.3.6 , the normal equations are:

LAX
~~JLiD.,

LAy
~~"-L>O,. -t-'D

L^

-Ax

-Ay

-Az

By

LBz

ABX ABy AB.

(2.3.7)

Carrying through the matrix algebra:

LA \-LA-LB

-LA

(2.3.8) -LS-AB_

Or, simplifying by defining a series of new scalar variables:

(2.3.9)
This is a simple 2x2 system, and to complete this section we will solve it algebraically to form a closed-form solution for s and t. There are a number of ways to solve Equation 2.3.9, including Cramer's rule [O'Neil87] and Gaussian elimination [Golub96]. Cramer's rule is theoretically interesting, but expensive, requiring approximately (+!)! multiply and divide operations for a general-sized problem. Gaussian elimination is less expensive, requiring 3/3 multiply and divide operations. There are other approaches to solving systems of linear equations that are significantly more reliable and often faster for much larger systems, including advanced direct solution methods such as QR factorizations for moderate-sized systems, and iterative methods for very large and sparse systems. I will derive the solution using Gaussian elimination, which is slightly less expensive than Cramer's rule for the 2x2 system. Here, we perform one row elimination step to yield an upper triangular system. The row elimination step is as follows. Modify row 2 of Equation 2.3.9 by taking the original row 2 and subtracting row 1 times Ll2/Ln. to yield Equation 2.3.10.

A,
Ai j
Aa

MI

Aa

Aa

Ai

rn -

Aa Ai

(2.3.10)

Simplify Equation 2.3.10 and multiply the new row 2 by Lu to yield the upper triangular system shown in Equation 2.3.11.

2.3

Fast, Robust Intersection of 3D Line Segments

197

Al

o AiA 2 -A 2 2

^.1*1 = 1
l/J

i;K\1
* I

L^ifB

I /

'^

Ml^J

(2-3.11)

Equation (2.3.11) immediately yields a solution for t,

t = AI^B ~ LUTA
*i 1-^22 ~~ A

(2.3.12)

and then, for 5,

s=

TA

~ Lnt

Ai

(2.3.13)

It is important to note that Equations 2.3.12 and 2.3.13 fail in certain degenerate cases, and it is these degenerate cases that require that we use tolerances in a limited way. Equation 2.3.13 will fail if line segment A has zero length, and Equation 2.3.12 will fail if either line segment has zero length or if the line segments are parallel. These situations lead to a divide-by-zero exception. I provide more discussion later in the section titled Special Cases. In terms of computational expense for the 2x2 problem, the only difference between solving for s and t using Gaussian elimination and Cramer's rule, for this case, is that the computation of s requires one multiply, one divide, and one subtraction for Gaussian elimination, but four multiplies, one divide, and two subtractions for Cramer's rule. To summarize from the derivation, given line segment A from point Al to A2 and line segment B from point Bl to B2, define the following intermediate variables:

LA = (A, - 4); ls = (B2 -B,)- ~AB = BI-AI


and

(2.3.14)

HI

^A LA;

L22 = LB LB;

Z^ = LA LB

Compute the parameters ^ and t that define the nearest points as, t=
and
s=
LU} B

" ~ ^2[A

(2.3.16)

AiAi ~ Ai
(2.3.17)

198

Section 2

Mathematics

The point where the first segment comes closest to the second segment is then given by: C = AL+ sLA (2.3.18)

and the point where the second segment comes closest to the first segment is given by: > = #!+ tLB (2.3.19)

We can consider a point located halfway between the two nearest points to be the single point in space that is "nearest" to both lines/segments as: P = (C + D)/2 (2.3.20)

Of course, when the lines do intersect, point P will be the intersection point. Special Cases When we talk about the nearest points of two infinite lines in space, there are only two possible special cases. The first case occurs when one or both lines are degenerate, defined by two points that are coincident in space. This occurs when point A^ is coincident with A2, or when B\ is coincident with B2. We'll call this the degenerate line special case. The second case occurs when the two lines are parallel, called $\<z parallel tine special case. It is easy to relate the degenerate line special case to the equations developed previously. Note that variable Ln, defined in Equation 2.3.15, is equal to the square of the length of line segment A, and L22 is equal to the square of the length of segment B. If either of these terms is zero, indicating that a line is degenerate, then the determinant of the matrix in Equation 2.3.9 is zero, and we cannot find a solution for s and t. Note that when either Ln or L22 is zero, then L12 is also zero. One standard test to check and decide if line A is degenerate is the following, bool line_is_degenerate = Ln < e2 ? true : false; Here, e is a small number such as perhaps 10"6. It is wiser to choose a value for such as 10"6 rather than a much smaller number such as machine epsilon. When segments A and B are both degenerate, then point C can be selected to be equal to point A, and point D can be selected to be equal to point B^. When segment A alone is degenerate, then point Cis equal to A\> and point D is found by computing the point on segment B that is nearest to point C. This involves computing a value for parameter tonly, from Equation 2.3.21.

-LBt = AB

(2.3.21)

2.3 Fast, Robust Intersection of 3D Line Segments

199

Equation 2.3.21 is a simplification of Equation 2.3.4 for the case where segment A is degenerate, and again it requires that we find a least-squares solution. The leastsquares solution, shown here using normal equations without derivation, is:

(2.3.22)

Point D can be calculated using Equation 2.3.2. When segment B alone is degenerate, then point D is set equal to B\, and point C is found by computing the point on segment A that is nearest to point D. This involves computing a value for parameter s only, from Equation 2.3.23, which is analogous to Equation 2.3.21. LA* = AB Solving for s yields: s = -p(2-3.23)

Ai

(2.3.24)

(^CBj5| mm a

Note that Equation 2.3.24 is identical to Equation 2.3.13 with fset equal to zero. Since t equals zero at point Blt our derivation here is consistent with the derivation for nondegenerate lines. Certainly, a nice way to handle the cases where only one segment is degenerate is to write a single subroutine that is used both when segment A alone is degenerate and when B alone is degenerate. It is possible to do this using either Equation 2.3.22 or Equation 2.3.24, as long as the variables are treated properly. The implementation provided on the companion CD-ROM uses Equation 2.3.24 for both cases, with parameters passed in such that the degenerate line is always treated as segment B, and the nondegenerate line is always treated as segment A. It is also easy to relate die parallel line special case to the equations developed previously, although it is not quite as obvious as the degenerate case. Here, we have to remember that L12 is the negative dot product of the vectors LA and LB, and when the lines are parallel, the dot product is equal to the negative of the length of LA times the length of LB. The determinant of the matrix in Equation 2.3-9 is given by LnL22 - Lu, and this is equal to zero when L12 is equal in magnitude to the length of LA times the length of LB. Thus, when the line segments are parallel, Equation 2.3.9 is singular and we cannot solve for s and t. In the case of infinite parallel lines, every point on line A is equidistant from line B. If it is important to find the distance between lines A and B, simply choose C to be equal to Alt and then use Equations 2.3.22 and 2.3.2 to find D. Then, the distance between C and D is the distance between the two segments. We'll look at how to handle finite length segments in the next section.

200 Coding Efficiency

Section 2 Mathematics

For coding efficiency, you should check first for degenerate lines, and then for parallel lines. This approach eliminates the need to calculate some of the convenience variables from Equations 2.3.14 and 2.3.15 when one or both of the lines are degenerate.

Dealing with Finite Line Segments


The previous two sections treated infinite lines. This is useful; however, there are perhaps many more situations in game development when it is required to process finite line segments. So, how do we adjust the results shown previously to deal with finitelength line segments? Line Segments that Are Not Parallel If Equations 2.3.12 and 2.3.13 generate values of s and t that are both within the range [0,1], then we don't need to do anything at all, since the finite length line segment results happen to be identical to the infinite line results. Whenever one or both of s and rare outside of [0,1], then we have to adjust the results. For nonparallel lines, there are two possibilities: 1) s or t is outside of [0,1 ] and the other is inside [0,1]; and 2) both s and rare outside of [0,1]. Figure 2.3.2 illustrates these two cases. For the case when just one of s or t is outside of [0,1], as in Figure 2.3.2a, all we need to do is: 1. Clamp the out-of-range parameter to [0,1]. 2. Compute the point on the line for the new parameter. This is the nearest point for the first segment. 3. Find the point on the other line that is nearest to the new point on the first line, with the nearest point calculation performed for a finite line segment. This is the nearest point for the second segment. In the last step, just clamp the value from Equation 2.3.22 to [0,1] before calculating the point on the other segment. For the case when both s and t are outside of [0,1], as in Figure 2.3.2b, the situation is slightly more complicated. The process is exactly the same except that we have to make a decision about which segment to use in the previous process. For example, if we selected line segment^ in Figure 2.3.2b, step 2 would produce point A2. Then, step 3 would produce point 5ls the nearest point on segment B to point A2. The pair of points, A2 and BI clear jv are not the correct points to choose for Cand D. Point Bl is the correct choice for D, but there is a point on segment A that is much closer to segment B than A2. In fact, the point generated by step 3 will always be the correct choice for either C or D. It is the point from step 2 that is incorrect. We can compute the other nearest point by just using the result from step 3. The process for both s and t outside of [0,1] then becomes:

2.3 Fast, Robust Intersection of 3D Line Segments Infinite Line Result _.__ ,
,
1

201

Infinite Line Result

"2

B7
Finite Line Result
0 I . . . . . . . Q

B2

Finite Line Result

FIGURE 2.3.2 Finite-length line segments. A) Either sortis outside of[0,1]. B) Both s and t are outside of[0,1]. 1. Choose a segment and clamp its out-of-range parameter to [0,1]. 2. Compute the point on the line for the new parameter. This is not guaranteed to be the nearest point for the first segment! 3. Find the point on the other line that is nearest to the new point on the first line, with the nearest point calculation performed for a finite line segment. This is the nearest point for the second line segment. 4. Find the point on the first line segment that is nearest to the point that resulted from step 3. This is the nearest point for the first line segment. If we select segment B in Figure 2.3.2b as our initial segment to correct, we would immediately select point 5;, and step 3 would give the point between Al and A2. In this case, step 4 is not required. The implementation provided here does not bother to check for this situation. Line Segments that Are Parallel There are two basic possible scenarios when the two segments are parallel, both of which are illustrated in Figure 2.3.3. First, there might be a single unique pair of nearest points, shown in Figure 2.3.3a. This always occurs when the projection of both segments into a line parallel to both do not overlap. Second, there might be a locus of possible nearest point pairs, shown in Figure 2.3.3b. Here, we could choose the two ^*-^5 nearest points to be any pair of nearest points between the two vertical gray lines. The ON me co implementation provided on the accompanying CD-ROM selects the nearest points for finite length, overlapping parallel line segments to be halfway between the gray lines; that is, at the midpoint of the overlapping portion of each segment. It is important to note that when the two segments are parallel, or almost parallel, the nearest points computed by this algorithm will often move erratically as the lines

Section 2

Mathematics

FIGURE 2.3.3 Parallel line segments. A) Unique nearest point pair. B) Locus of nearest pointpairs. are rotated slightly. The algorithm will not fail in this case, but the results can be confusing and problematic, as the nearest points jump back and forth between the ends of the segments. This is illustrated in Figure 2.3.4. Shown in Figure 2.3.4a, the nearest points will stay at the far left until the lines become exactly parallel, at which point the nearest points will jump to the middle of the overlap section. Then, as the lines continue to rotate past parallel, the nearest points will jump to the far right, shown in Figure 2.3.4b. This behavior may be problematic in some game applications. It is possible to treat the behavior by using a different approach to selecting the nearest point when lines are parallel or near parallel. For example, you could implement a rule that arbitrarily selects the point nearest A\ as the nearest point on segment A when the segments are parallel within, say, 5 degrees of each other. To avoid the erratic behavior at the 5-degree boundary, you would need to blend this arbitrary nearest point with an algorithmically generated nearest point between, say, 5 and 10 degrees, with the arbitrary solution being 100% at 5 degrees and 0% at 10 degrees. This solution will increase the expense of the algorithm. There are certainly other approaches, including ones that may be simpler, cheaper, and more reliable. The implementation provided on the companion CDROM does not attempt to manage this behavior.

*'**"*

FIGURE 2.3.4 Erratic movement of nearest points for nearly parallel line segments. A) Nearest points at the left. B) Nearest points at the right.

Implementation Description
The implementation includes four C-language functions, contained in the files lineintersect_utils.h and lineintersect_utils.cpp. The primary interface is the function IntersectLineSegments, which takes parameters defining the two line segments, and returns points C, D, and P, as well as a vector between points Cand D. The function

2.3 Fast, Robust Intersection of 3D Line Segments

203

also takes a parameter indicating whether you want the line segments to be treated as infinite lines, and a tolerance parameter to be used to check the degenerate and parallel line special cases. The vector between C and D can be used outside of the implementation to determine a distance between the lines. It is important to note that the vector is not necessarily normal to either of the line segments if the lines are finite. If the lines are infinite and at least one is not degenerate, the vector will be normal to the nondegenerate line(s). The supporting functions are as follows: FindNearestPointOnLineSegment calculates the point on a line segment that is nearest to a given point in three-dimensional space. FindNearestPointOjParallelLineSegments calculates representative (and possibly unique) values for Cand D for the case of parallel lines/segments. AdjustNearestPoints adjusts the values of C and D from an infinite line solution to a finite length line segment solution. ,, -, ^-ll-^ The code is documented with references to the text. A test program is also provided, called line_intersection_demo. The demo requires that you link to the GLUT library for OpenGL. Project files are present for Microsoft Visual C++ 6.0 for Windows. It should not be too difficult to port this to other systems that support OpenGL and GLUT.

Opportunities to Optimize
The implementation source code was written carefully, but without any attempt to optimize for a particular processor or instruction set. There are a number of opportunities in every code to optimize the implementation for a given platform. In this case, perhaps the biggest opportunity is in the area of vectorization. There are numerous operations in this code that require a multiply or addition/subtraction operation on all three elements of a vector. These are prime opportunities to vectorize. Additionally, if you have an instruction set that supports high-level operations such as dot products, take advantage when evaluating Equation (2.3.15), for example. To truly maximize the performance, I strongly recommend that you use a professional code profiling utility to identify bottlenecks and opportunities for your target platform(s). The text presented here and the implementation provided on the accompanying CD-ROM is rigorous, and treats every conceivable situation. The code is generally efficient, but in the case where the infinite lines intersect outside of the range of the finite segments (in other words, one or both ofs and t are outside of [0,1]), the true nearest points are not necessarily cheap to compute. In fact, the nearest point problem we've solved here is a minimization problem, and as is the case in general, the cost increases when constraints are applied to minimization problems. Beyond processor/platform-specific optimizations, it is certainly possible to remove parts of the implementation that are not required for your application. For example, if you do not need to treat finite length segments, remove everything that deals with finite length

I i% ONWCD

204

Section 2

Mathematics

segments. Just have the main function return a bool that is true when the nearest point is found between the finite segment endpoints, and false when the nearest point is found outside the finite segment endpoints.

Conclusions
The algorithm discussed here is rigorous and capable of handling any line intersection problem without failing. Depending on your particular use of line intersections, you may need to adjust the algorithm; for example, to manage the idiosyncrasies that arise when two finite segments are nearly parallel, or to remove the processing of finite segments when you only deal with infinite lines. I sincerely hope that some of you will benefit from this formal discussion of line and line segment intersections, along with ready-to-use source code.

References
[Golub96] Golub, Gene H., and Charles F. van Loan, Matrix Computations, Third Edition, The Johns Hopkins University Press, 1996. [O'Neil87] O'Neil, Peter V., Advanced Engineering Mathematics, Second Edition, Wadsworth Publishing Company, 1987.

2.4
Inverse Trajectory Determination
Aaron Nicholls, Microsoft
aaron [email protected]

problem frequently faced in the development of games is that of calculating trajectories. In the most common case, we have a velocity and a direction for a projectile, and need to determine the location at a given time, and whether the projectile has collided with any other entities. This is a simple iterative problem, but it is not all that is required for most games. In many cases, we also need to solve the inverse of this problem; namely, given a number of constants (gravity, starting position, intended destination), we must calculate the proper yaw, pitch, and/or initial velocity to propel the projectile between the two points. In addition, once we have a solution for this problem, we can use this as a framework for solving more complex variants of the same problem. This gem expects that the reader is familiar with fundamental 2D/3D transformations, basic integral calculus, and trigonometry. Simplifying the Problem at Hand There are several ways to simplify the problem, and we can begin by reducing a threedimensional problem to a two-dimensional one. Given an initial velocity and direction for a projectile, if the only acting force is gravity (which can usually be assumed to be constant), the trajectory of the projectile will be parabolic and planar. Therefore, by transforming this planar trajectory into two dimensions (x and_y), we can simplify the problem significantly. In addition, by translating the starting point to the origin, we can remove the initial x and y values from most of the equations, focusing on the destination coordinates. A sample trajectory, rotated into the xy plane and translated to the origin, is shown in Figure 2.4.1. In addition, we need to determine exactly what the problem is that we wish to solve. In this case, our variables are initial velocity, angle of elevation, and distance in x and y between the points. In the case where we know three of the four values (and thus have one unknown), our goal is to produce an equation that defines the value of the unknown in terms of the three known values.

205

206

Section 2 Mathematics

Destination (x, y)

Source

v, = Vj cos 6

FIGURE 2.4.1 Trajectory between two points in two dimensions. However, it is very common to have to deal with multiple unknowns. In that case, the best solution is typically to get rid of some of the variables by setting their values to constants. For instance, we often know the locations of the two points, but need to provide an initial velocity and angle of elevation. In this case, we can eliminate initial velocity as a variable by setting it to the maximum possible velocity vmac By doing so, we only have one unknown, and we simply need to determine the angle of elevation 6 in terms of v ,-, x, and y. This technique and guidelines for using it are discussed in further detail later in this gem, under Solving for Multiple Variables. Defining Position and Velocity as a Function of Time Now that we have reduced the problem to two dimensions, we can identify the velocity and acceleration working in each dimension. Starting with initial velocity vf, angle of elevation ?, and gravity g, we can express initial velocity along the x and y axes as follows:

vyi = Vf sinO Since the only force acting upon this system is gravity, we can assume that vertical velocity (v,) stays constant, while gravity is acting upon horizontal velocity (vy). The two can be expressed as follows: vx = v{ cosO (2.4.1) (2.4.2)

v = v/ sin 0 gt

Next, we integrate the velocity equations to determine the position at a given time (assuming the origin as the starting point).

2.4 Inverse Trajectory Determination

207

x = \vt cos 9 dt
-> x = vf cos 9 (2.4.3)

y = J (,- sin 0- gt]dt y = vf sin 0 t2 (2.4.4)

A Special Case:
Both Points at the Same Elevation Before tackling the general case of this problem, let's examine a simpler case, which will give us insight into solving the more general problem. One of the common special cases is that in which both the start and end points have the same y value. An example of this might be a game where the ground is flat, and a cannon on the ground is firing at a ground target. In this case, we know that y, the horizontal displacement between the two points, is zero. Therefore, we can simplify the horizontal position equation by setting y=0. This allows us to simplify Equation 2.4.4 to solve for time t, initial velocity vt, or angle of elevation 9 as follows:

y = vf sin 9 -- gf2 =0
2v sin 6 -> t = '- g
Ft
- V; = & -

2 sin 0

In addition, this leads to a simplified formula for calculating x for this special case: x = v\ '- - - cos0
i

I g )
g

2v sin 9 cos 9 x = '- -

208

Section 2 Mathematics Using the trigonometric identity sin 9 cos 0 = sin 29, we can simplify further as follows:

2v; sin 29 x = 'g

. . . (2.4.5)

In addition, in the previous case where a ground-based cannon is firing at ground targets on flat terrain, this equation can be used to determine the maximum horizontal range of a cannon at angle of elevation 0, given maximum projectile velocity v^: ,2sin29

Range =

(2.4.6)

g
Solving for Angle of Elevation Now that we have defined the equations that define the projectile's movement and solved for a special case, we can continue to solve for the more general case. First, we will analyze the case in which both points may not be at the same altitude, and we must determine the angle of elevation or velocity required to propel a projectile between the two points. Since we have expressed x and y in terms of t, we can begin by removing t from the equation and defining x and^ in terms of each other.
n x = v-t cos 9 t =
X

vf cos 9

Next, we replace t with x I v, cos 9 in the equation for y to remove t from the equation. y = vitsm9--gt2

vf cos 9

-> y = x tan 9

--

2vf cos2 9

We then use the trigonometry identity I/cos2 9 = tan2 9 + 1 to reduce further.


")

2v

cos2 9

'

2.4 Inverse Trajectory Determination

209

. me2 (tan2 0 + 1) y = x tan 0 - &


2V;

xtan0

=0

(2.4.7)

As odd as this final equation may look, it serves a purpose: this version of the equation fits into the quadratic equation as follows: tan0 = where
-b V2 - 4ac 2a

Plugging the preceding values of a, b, and c into the quadratic equation and solving for 9, we obtain the following:

-x

6 = tan

s
V-

\\
-x

-> 9 = tan

V;

(2.4.8)

210

Section 2 Mathematics

The quadratic form provides us with a way to solve for G, given a known initial velocity vf, horizontal displacement x, and vertical displacement y. If (b2 - 4ac) is positive, we have two possible solutions, and if it is negative, there are no solutions for the given parameters. In addition, if the initial velocity is zero, we know that the trajectory is entirely vertical, so 6 is irrelevant. When dealing with two trajectories, it is important to remember that the flatter trajectory will yield a faster route to the target, and is thereby preferable in most cases. If both angles are between -7C/2 and 7t/2, the angle closer to zero will yield the flatter trajectory for a given vf. A case with two valid angles of elevation to reach a given target is shown in Figure 2.4.2. Here, Trajectory 2 is the fastest.

Trajectory 1

Trajectory 2 Destination (x, y)

Source

FIGURE 2.4.2 Two angle of elevation solutions OC and for a given Vj.

Solving for Initial Velocity Now that we have the problem solved for the case where 0 is unknown, we can change Equation 2.4.7 slightly to solve for initial velocity vt, given a known angle of elevation 9, horizontal displacement x, and vertical displacement y as follows:
2

- tan2 9 - x tan 0 + & + 7 = 0 2vs 6 + r = x tan 6 - y


2v:

_ 2

-> ^-(tan2 0 + 1) = xtanfl - y 2v We then multiply both sides by V; /(x tan Q -y), thereby isolating initial velocity.

2.4 Inverse Trajectory Determination

211

2(x tan 9 -

-(tan 2 0

Solving for vf, we get the following:

y)
Again, we can choose to use the trigonometric identity I/cos2 6 = fan2 6 + 1 to simplify the square root.

(2A10)
Again, since we are dealing with a square root, there are some cases that have no solution. An example would be when the slope of the initial trajectory is less than the slope to the target. One special case is where 6=n/2 (straight upward), since there can be two solutions. Calculating Maximum Height for a Trajectory Solving for peak height of a trajectory is straightforward: vertical peak is defined as the point where vertical velocity vy=0, given 9>0. Therefore, we simply solve the vertical velocity equation as follows: vy(t) = vi sin0 - gt = 0 Solving for t, we get the following:

v sin 6

g
Now, to determine the maximum altitude, we substitute the preceding value for t in the vertical position equation as follows:

v] sin2 9 g\vi sin 9 |

v] sin2 9 --

(2.4.11)

212

Section 2 Mathematics As mentioned previously, this depends on &>0. If the angle of elevation 6 is negative (pointing downward), the vertical peak will be a.ty=0, since the projectile's initial downward velocity is only increased by gravity. This is somewhat of a special case, since vertical velocity is not necessarily zero at the vertical peak in this case. Calculating Flight Time In order to determine time to destination, we can simply rewrite the horizontal position from Equation 2.4.3 in terms of?.

x(t) = vf cos 6 t =

vi cos d

However, in the case where v^ = 0 or cos 6 = 0, t is undefined if expressed in terms of x In addition, in this case, x will always be zero, and no solutions exist in this case if the two points are not at the same x value. In implementation, these boundary cases are worth testing, since a mistake here can cause an engine to crash or behave erratically at times. To solve for t when vt = 0 or cos 6 = 0, we can use the vertical position equation from Equation 2.4.4 instead.

y(t) = vf sin 0 - - gt2


If Vi = 0, we can use the following equation to express t in terms of/ and g.

y - & -> t = i
However, if cos 6 = 0 and v{>0, there can be one or two solutions (the latter happens only ifd>0, since vf >0 in practice). In addition, we know that if cos 6 = 0, sin 6 = 1. This reduces the problem further, but we still need to express this in terms of t as follows: y = vitsine--gt2

-> - gt2 - Vft sin 6 + y = 0

(2.4.12)

This is a quadratic in terms of t, and the solution thereof is left to the reader. Solving for Multiple Variables As mentioned near the start of this topic, it is very common that two or more values are unknown or need to be determined, usually 9 and vf (since both points are usually

2.4 Inverse Trajectory Determination

213

known). In multivariate cases, the set of possible solutions expands greatly, so in order to solve the problem, the fastest approach is to eliminate some of the unknowns. In the most common case, we are given two points and a maximum initial velocity vmax, and need to solve for both v-t and G. When reducing variables in order to simplify to a single-variable problem, it is important to reduce in a manner that does not overly restrict possible solutions. In the previous case in which both 6 and vt are unknown, restricting 6 greatly reduces the number of solutions, and is undesirable. On the other hand, setting vf = vmax and varying 6 preserves a larger set of landing points. This same logic can be extended to other forms of this problem, although there is not space to elaborate further within the scope of this gem.

Optimizing Implementation
When implementing the previous equations in code, there are a few optimizations that can make a substantial difference in performance. This is because trigonometric functions have a very high overhead on most systems.
Avoid Oversimplification

When deriving mathematical calculations, there is a tendency to reduce formulae to their simplest mathematical form, rather than the simplest or most optimal algorithm. For instance, in solving for initial velocity vf, we came across Equations 2.4.9 and 2.4.10 as follows:
v- = x

The tendency from a mathematical point of view would be to prefer the latter form, since it reduces the equation; however, in implementation, it is more efficient to precalculate tan 9 and use it twice in the first equation, rather than calculating both tan 9 and cos 9 as is done in the latter formula. In addition, even if we choose to use the second equation (and not simplify to terms of tan Q), leaving cos 9 outside of the square root bracket means that two divisions need to be done: one inside the bracket and one outside. To optimize, one can either place the cos 9 inside the divisor within the bracket as cos2 9, or multiply x by II cos 9.

214 Reduce Trigonometric Functions to Simpler Calculations

Section 2 Mathematics

Rather than using the provided functions for sin, cos, and tan, it is much more efficient to use pregenerated lookup tables or take advantage of known relations between other variables. For instance, to calculate tan 9, you can simply divide the initial value of vy by t>x, since they are defined in terms of sin 6 and cos 9, respectively, and are likely precomputed. In addition, there is additional room for optimizationThe purpose here is simply to alert the reader to the high computational cost involved with trigonometric calculation and the importance of optimization.

Summary
Efficient trajectory production can enhance perceived AI quality and engine performance. Although the derivation can be math intensive, the resulting equations are relatively simple and easy to understand. In addition, once the process involved in deriving and simplifying the previous formulae is understood, it is easy to apply that knowledge to more complicated situations, such as moving targets, nonvertical acceleration, and other related problems.

2.5
The Parallel Transport Frame
Carl Dougan
[email protected]

any tasks in computer games require generating a suitable orientation as an object moves through space. Let's say you need to orient a camera flying along a looping path. You'd probably want the camera to turn with the path and point along the direction of travel. When the path loops, the orientation of the camera should change appropriately, to follow the loop. You wouldn't want it to suddenly flip or twist, but turn only to match the changes of the path. The parallel transport frame method can help provide this "steady" orientation. You can also use this technique in the generation of geometry. A common operation in 3D modeling is lofting, where a 2D shape is extruded along a path curve, and multiple sections made from the shape are connected together to produce 3D geometry. If the 2D shape was a circle, the resulting 3D model would be a tube, centered on the path curve. The same criteria apply in calculating the orientation of the shape as did with the camerathe orientation should "follow" the path and shouldn't be subject to unnecessary twist. The parallel transport method gets its stability by incrementally rotating a coordinate system (the frame) as it is translated along a curve. This "memory" of the previous frame's orientation is what allows the elimination of unnecessary twistonly the minimal amount of rotation needed to stay parallel to the curve is applied at each step. Unfortunately, in order to calculate the frame at the end of a curve, you need to iterate a frame along the path, all the way from the start, rotating it at each step. Two other commonly used methods of curve framing are the Frenet Frame and the Fixed Up method [EberlyOl], which can be calculated analytically at any point on the path, in one calculation. They have other caveats, however, which will be described later.

The Technique
A relatively simple numerical technique can be used to calculate the parallel transport frame [Glassner90]. You take an arbitrary initial frame, translate it along the curve, and at each iteration, rotate it to stay as "parallel" to the curve as possible.

215

216

Section 2 Mathematics Given: a Curve C an existing frame Fl at t-1 a tangent Tl at t-1 (the 1st derivative or velocity of C at t-1) a tangent T2 at t a new frame F2 at the next time t can be calculated as follows: F2s position is the value of C at t. F2s orientation can be found by rotating Fl about an axis A with angle Ot, where A = Tl X T2 and a = ArcCos((Tl T2)/(|T1||T2|)) If the tangents are parallel, the rotation can be skipped (i.e., if Tl X T2 is zero) (Figure 2.5.1).

A
FIGURE 2.5.1 The frame at t-1 is rotated about A by Of to calculate the frame at t. The initial frame is arbitrary. You can calculate an initial frame in which an axis lies along the tangent with the Fixed Up or the Frenet Frame method. In some cases, you may find it desirable to use parallel transport to generate frames at a coarse sampling along the curve, and then achieve smooth rotation between the sample frames by using quaternion interpolation. Using quaternions is desirable anyway, since there is an efficient method of generating a quaternion from a rotation axis and angle [EberlyOl]. You can use the angle and axis shown previously to generate a rotation quaternion, and then multiply it with the previous frame's quaternion to perform the rotation. Moving Objects You can orient a moving object with a single parallel transport rotation each time the object is moved, presumably once per frame. We need three pieces of information: the velocity of the object at the current and previous locations, and the orientation at the previous location. The velocities correspond to the tangents Tl and T2 shown previously. For some tasks, the parallel transport frame may be too "stable." For example, an aircraft flying an S-shaped path on the horizontal plane would never bank. To achieve

2.5 The Parallel Transport Frame

217

realistic-looking simulation of flight, you may need to use a different solution, such as simulating the physics of motion. Craig Reynolds describes a relatively simple, and thus fast, technique for orienting flocking "boids" that includes banking [Reynolds99]. Reynolds' technique is similar to parallel transport in that it also relies on "memory" of the previous frame. Comparison The details here show how the parallel transport method we have looked at so far compares with the Frenet Frame and Fixed Up methods of curve framing. The Frenet Frame The Frenet Frame is built from three orthogonal axes: The tangent of the curve The cross-product of the tangent, and the second derivative Another vector generated from the cross-product of the prior two vectors The Frenet Frame is problematic for the uses already discussed because it cannot be calculated when the second derivative is zero. This occurs at points of inflection and on straight sections of the curve [Hanson95]. Clearly, not being able to calculate a frame on a straight section is a big problem for our purposes. In addition, the frame may spin, due to changes in the second derivative. In the case of an S-shaped curve, for example, the second derivative points into the curves, flipping sides on the upper and lower halves. The resulting Frenet Frames on the S-shaped curve will flip in consequence. Figure 2.5.2 shows what this means graphically; instead of continuous

FIGURE 2.5.2 Second derivative on an S-shaped curve, and Frenet Frame generated tube from the same curve.

218

Section 2

Mathematics

geometry, we have a discontinuity where the second derivative switches sides. If this was a flock of birds, they would suddenly flip upside down at that point.
The Fixed Up Method

In the case of the Fixed Up method, the tangent T and an arbitrary vector V (the Fixed Up vector) are used to generate three axes of the resulting frame, the direction D, up U, and right R vectors [EberlyOl].

D = T / |T| R=DxV/|DxV| U = RxD


A problem with the Fixed Up method occurs when the tangent and the arbitrary vector chosen are parallel or close to parallel. When T and V are parallel, the crossproduct of D and V is zero and the frame cannot be built. Even if they are very close, the twist of the resulting vector relative to the tangent will vary greatly with small changes in T, twisting the resulting frame. This isn't a problem if you can constrain the pathwhich may be possible for some tasks, like building the geometry of freeways, but may not be for others, like building the geometry of a roller coaster. Figure 2.5.3 shows a comparison of a tube generated using parallel transport with one using the Fixed Up method. In the upper and lower sections of the curve, the cross-product of tangent and the Fixed Up vector is coming out of the page. In the middle section, it is going into the page. The abrupt flip causes the visible twist in the generated geometry.

Fixed Up

Parallel Transport

FIGURE 2.5.3 Comparison of Fixed Up and parallel transport.

2.5 The Parallel Transport Frame

219

Conclusion
For unconstrained pathsfor example, flying missiles or looping tracksparallel transport is one method that you can use to keep the tracks from twisting and the missiles from flipping.

References
[Glassner90] Bloomenthal, Jules, "Calculation of Reference Frames Along a Space Curve," Graphics Gems, Academic Press, 1990: pp. 567-571. [EberlyOl] Eberly, David H., 3D Game Engine Design, Academic Press, 2001. [Hanson95] Hanson, Andrew)., and Ma, Hui, Parallel Transport Approach to Curve Framing, Department of Computer Science, Indiana University, 1995. [Reynolds99] Reynolds, Craig, "Steering Behaviors for Autonomous Characters," available online at www.red3d.com/cwr/steer/gdc99/index.html.

2.6
Smooth C2 Quaternion-based Flythrough Paths
Alex Vlachos, ATI Research; and John Isidore
[email protected] and [email protected]

n this gem, we describe a method for smoothly interpolating a camera's position and orientation to produce a flythrough with C2 continuity. We draw on several known methods and provide a C++ class that implements the methods described here.

Introduction
Smoothly interpolating the positions of a flythrough path can easily be achieved by applying a natural cubic spline to the sample points. The orientations, on the other hand, require a little more attention. We describe a method for converting a quaternion in S3 space (points on the unit hypersphere) into R4 space (points in 4D space) [Johnstone99]. Once the quaternion is in R4 space, any 4D spline can be applied to the transformed data. The resulting interpolated points can then be transformed back into S3 space and used as a quaternion. In addition, a technique called selective negation is described to preprocess the quaternions in a way that produces the shortest rotation path between sample point orientations. Camera cuts (moving a camera to a new location) are achieved by introducing phantom points around the camera cut similar to the way an open spline is padded. These additional points are needed to pad the spline to produce smooth results near the cut point. The code provided describes cut points as part of a single fly path and simplifies the overall code. Internally to the C++ class, the individual cut segments are treated as separate splines without the overhead of creating a spline for each segment.

Position Interpolation
Let's now discuss position interpolation. Sample Points There are two common ways to specify sample points. The first is to have each segment between control points represent a constant time (for example, each control point rep220

2.6 Smooth C2 Quaternion-based Flythrough Paths

221

resents one second of time). The second is to use the control points only to define the shape of the camera path, and to have the camera move at a constant speed along this path. The code provided with this gem assumes a constant time between control points, although this code could easily be modified for the constant speed technique. Natural Cubic Spline A natural cubic spline is chosen due to the high degree of continuity it provides, namely C2. However, it's important to note that any spline may be used in place of the natural cubic spline. Code for implementing this spline is widely available, including Numerical Recipes In C [Press97]. The sample code provided is modeled after this. A natural cubic spline is an interpolating curve that is a mathematical representation of the original drafting spline. One important characteristic of this spline is its lack of local control. This means that if any single control point is moved, the entire spline is affected. This isn't necessarily a disadvantage; in fact, this functionality may be desirable. As you begin to use this spline, you'll see the advantages it has in smoothing out the camera movement when you sample the spline at a higher frequency. It is important to differentiate between open and closed splines. In the case of a closed spline, the spline is specified such that the last point is the same as the first point. This is done to treat the camera path as a closed loop. To work around any possible discontinuities in the spline at the loop point, simply replicate the last four points of the spline to the beginning of the array, and the first four sample points to the end of the array. In practice, we've found that using four points was sufficient to eliminate any visual artifacts. This replication eliminates the need for modulus arithmetic and also simplifies our preprocessing of the camera path. This is even more important when dealing with orientations using the selective negation method as described later (Figure 2.6.1).

2,12

-2,8
FIGURE 2.6.1 Replicating points for a closed spline.

222

Section 2 Mathematics

FIGURE 2.6.2 Creating phantom points for an open spline.

In contrast, an open spline has a different beginning and end point. In order to sample the spline, you need to pad the spline with several "phantom" points at both the beginning and end of the open spline (Figure 2.6.2). A constant velocity is assumed for the phantom points before and after the open spline path. At the beginning of the spline in Figure 2.6.2, the vector V^Pj-Po is subtracted from P0 to get the resulting point P_j. Similarly, V0 is subtracted from P_j to create P_2, and so on. The trailing phantom points are calculated in a similar way.

Orientation Interpolation Sample Points


Unit quaternions are used as the orientation data at the sample points. Quaternions can be very useful for numerous applications. The beauty of quaternions is that, for rotations, they take the form of a normalized 4-element vector (later referred to as a 3element vector and a scalar component). This is exactly enough information to repre-

2.6 Smooth C2 Quaternion-based Flythrough Paths

223

sent an axis of rotation and an angle of rotation around that axis [GPG1]. Quaternions give us everything we need to represent a rotation and nothing more. For orientation, however, there is an ambiguity in using quaternions. Orientation can be thought of as a rotation from a base orientation. When using quaternions, there are two possible rotations that will bring you to the same orientation. Suppose there is a counterclockwise rotation 9 about an axis w that gives you the desired orientation. A rotation by 360-0 about the axis w also results in the same orientation. When converted into a quaternion representation, the second quaternion is simply the negation of the first one.

Direction of Rotation and Selective Negation


When performing quaternion interpolation, there is one small nuance that needs to be considered. When representing an orientation, either a quaternion or its negation will suffice. However, when interpolating orientations (for example, performing a rotation), the positive and negative quaternions result in vastly different rotations and consequently different camera paths. If the desired result is to perform the smallest possible rotation between each pair of two orientations, you can preprocess the quaternions to achieve this. Taking the dot product of two quaternions gives the cosine of half the angle of rotation between them. If this quantity is negative, the angle of rotation between the two quaternions is greater than 180 degrees. In this case, negating one of the orientation quaternions makes the angle between the two quaternions less than 180 degrees. In terms of interpolation, this makes the orientation spline path always perform the shortest rotation between the orientation key frames. We call this process selective negation. The technique of selectively negating orientation quaternions can be incorporated as a preprocessing step for a camera flythrough path. For the preprocessing step, traverse the flythrough path from start to end, and for each quaternion q,on the path, negate it if the dot product between it and its predecessor is negative (in other words, if (qt- q,.j)<0). Using selective negation as a preprocessing step makes spline interpolation much more efficient by not requiring the selective negation math for every sample. To preprocess a closed spline path, it is necessary to replicate the first four points of the spline path and append them to the end of the path prior to the selective negation. Note that the replicated points may have different signs than the original points. When dealing with an open spline, you need to create phantom quaternions (corresponding to the phantom control points) to pad the spline. The concept is similar in that you want to linearly interpolate the difference between the two quaternions closest to the beginning or end of the path. However, linearly interpolating quaternions doesn't suffice. Instead, we use the spherical linear interpolation (slerp) algorithm. Given quaternions q0 and qt, we need to generate four phantom quaternionsq.j, q_2, and so on to pad the beginning of an open spline. We use the

224

Section 2 Mathematics

slerp function (spherical linear interpolation) to slerp from q, to q0 with a slerp value of 2.0. This effectively gives us a linear change in rotation at our phantom points. Once we have preprocessed our entire list of orientation quaternions for interpolation, it is straightforward to perform smooth spline-based quaternion interpolation techniques.

Spline Interpolation for Quaternions


As seen for positional interpolation, splines can be used to give us much smoother interpolation than linear interpolation can. However, spline interpolation for quaternions is not so straightforward, and there are several techniques that can be used. One technique simply interpolates the raw quaternion values, and then renormalizes the resulting quaternion. However, this technique does not result in a smooth path and produces bizarre changes in angular velocity. Another idea is to use techniques based on the logarithms of quaternions. SQUAD (spherical quadrangle interpolation) [Shoemake91] is an example of this. A performance limitation is incurred when using these techniques because they require transcendental functions (sin, cos, log, pow, and so on). Other techniques involve blending between great 2-spheres laying on the unit quaternion hypersphere [Kim95], or involve some sort of iterative numeric technique [Neilson92]. While many of these techniques provide decent results, most of them do not provide C2 continuity or are computationally prohibitive to use, especially when many flythrough paths are used (for game characters or projectiles, as an example). However, there is a technique for quaternion spline interpolation that gives very good results and obeys derivative continuity requirements. This uses an invertible rational mapping Qohnstone99] M between the unit quaternion 4-sphere (S3) and another four-dimensional space (R4). In the following equations, a, b, and c are the components of the vector portion of the quaternion, and s is the scalar portion. The transformation M-/ from S3 >R4 is: x = a/sqrt(2(l-s)) y = b/sqrt(2(l-s)) z = c/sqrt(2(l-s)) w=(l-s) /sqrt(2(l-s)) The transformation M from R4 > S3 is:
a = 2xu> I (y? + y2 + z? + iv2) b = 2yw / (x2 + y2 + z2 + ui2) c = 2zw I (x2 + y2 + z2 + iv2)

To use this for quaternion spline interpolation is straightforward. First, selective negation should be applied to the control quaternions to assure the shortest possible rotation between control points. After this, apply M ; to all the control quaternions to get their resulting value in R4. This can be done as a preprocessing step and can be

2.6 Smooth C2 Quaternion-based Flythrough Paths

225

done in the flythrough-path building or loading stage of a program. This way, die square root does