Fuzzy Logic on the GPU in CUDA
In this tutorial, we will implement fuzzy logic on the GPU (an NVIDIA 8800). You should read our following paper AND the CUDA API documentation before trying to understand all of the logic and programming in this tutorial.

- D. Anderson and S. Coupland, Parallelisation of Fuzzy Inference on a Graphics Processor Unit Using the Compute Unified Device Architecture, UKCI 2008, the 8th Annual Workshop on Computational Intelligence

This tutorial utilizes a set of rules with two antecedents and one consequent. I recommend jumping into the functions below first, to get an understanding of the algorithm, then looking at the variable explanations in order to understand which are CPU, GPU, what memory types, and etc. I have tried to simplify the code some for tutorial purposes. Any generalizations and/or further optimizations are discussed in the paper and in the following code below. I feel like a broken record saying this, but MAKE SURE YOU HAVE YOUR SYSTEM SETUP PROPERLY, MAKE SURE YOU PAY ATTENTION TO WHAT IS RUNNING ON YOUR SYSTEM WHEN YOU CALCULATE TIMINGS, MAKE SURE YOU USE HIGH PRECISION TIMERS W.R.T. THE OPERATING ENVIRONMENT THAT YOU ARE ON, MAKE SURE YOU HAVE UP TO DATE DRIVERS AND THE NEWEST CUDA LIB, ETC. To get performance you need to pay attention to all of these little things. One error and you can lose all of your performance! Oh the art of GPU programming!


These are the includes that we need.

#include <stdlib.h>            
#include <stdio.h> 
#include <string.h> 
#include <math.h> 
#include <windows.h> 
#include <winbase.h> 
#include <time.h> 
#include <cutil.h>             

We define the maximum number of threads that can exist in a block, there are 4 parameters in a trap, the MYMAX and MYMIN macros are just for minimum and maxiumum, and the TRAPCALC is to calc the membership of an input in a trap. Rval is the number of rules. Keep this value as a power of two for the moment, but we can change it later to be a non-power of two value. The SAMPLE_RATE is the number of samplings along the consequent domain. The higher the sampling, the more accurate the results, and the faster the CUDA application is to the CPU. This is for the most part because we can get the largest amount of parallisim at steps which involve caluclations on elements sampled over the consequent domain, and they are all independent calculations.

#define MAX_NUMBER_OF_THREADS_PER_BLOCK 512	//Max number of threads per a CUDA block
#define NUM_PARAMS 4				//Number of parameters, i.e. 4 for a trap

#define MYMAX(a,b) ( (a) > (b) ? (a) : (b) )
#define MYMIN(a,b) ( (a) < (b) ? (a) : (b) )

#define TRAPCALC(x,a,b,c,d) ( MYMAX( MYMIN( MYMIN( ((x)-(a))/((b)-(a)) , ((d)-(x))/((d)-(c)) ) , 1.0 ) , 0.0  ) )

#define Rval 64					//Number of rules
#define SAMPLE_RATE 512				//Discrete consequent domain sample rate

Now, we need to deal with how inputs (2 values for rules with 2 antecedents) are passed down to the GPU. CUDA dictates that we need a descriptor for the texture. Constant memory is read-only. This is not a good candidate because we will need to change this input over time. Global memory is read/write, but it has the slowest speeds! Texture memory is read-only w.r.t. access on the GPU, but we can constantly keep updating it on the host, i.e. we can change the program input over time, it will not be written to on the GPU ever, and its only needed for read-only access for a program on the GPU. Texture memory is faster than global memory. Therefore, we create the descriptor FIS_Inputs_tex_channelDesc for our inputs, and we make a cudaArray (an array on the GPU) called CUDA_FIS_Inputs.

cudaChannelFormatDesc FIS_Inputs_tex_channelDesc;
cudaArray* CUDA_FIS_Inputs;

OK, now we need to setup the rules and the fuzzy sets that will be used in fuzzy inference. We are going to use R rules (Rval parameter). For each rule we have 2 antecedents and one consequent. Therefore, we need to keep track of 3 fuzzy sets. So, each rule will have 12 numbers, 3 sets of 4 values. We need to create the texture memory and descriptors for this memory region.

cudaChannelFormatDesc FIS_RulesAndFuzzySets_tex_channelDesc;
cudaArray* CUDA_FIS_RulesAndFuzzySets;

CPU-side memory (we will analyze these below in their respective function for context).

float FIS_Inputs[2];
float FIS_RulesAndFuzzySets[Rval][NUM_PARAMS * 2];
float FIS_ConsequentFuzzySets[Rval][NUM_PARAMS];
float FIS_DiscreteConsequents[Rval][SAMPLE_RATE];

These first 4 variables are GPU memory (this lives on the GPU and these are just pointers, i.e. no CPU memory yet!). The second 4 are CPU side variables that will hold onto the results from the GPU program.

float *Input1_Result1;
float *Input1_Result2;
float *Input2_Result1;
float *Input2_Result2;
float *CPU_Input1_Result1;
float *CPU_Input1_Result2;
float *CPU_Input2_Result1;
float *CPU_Input2_Result2;

Here are the discrete consequents (this is GPU memory). This will be a discrete representation of the fuzzy consequent sets, one for each rule respectively.

cudaChannelFormatDesc FIS_DiscreteConsequents_tex_channelDesc;
cudaArray* CUDA_FIS_DiscreteConsequents;

These are the CUDA texture IDs for the GPU. We need one for each texture (i.e. inputs, rules and fuzzy sets, and the discrete consequent).

texture<float, 2, cudaReadModeElementType> FIS_Inputs_tex;	
texture<float, 2, cudaReadModeElementType> FIS_RulesAndFuzzySets_tex;
texture<float, 2, cudaReadModeElementType> FIS_DiscreteConsequents_tex;

Result of implication (on the GPU).

float* CUDA_FIS_ImplicationResults;
float* CUDA_FIS_ImplicationResults2;

Result of implication (on the CPU). We will read the results from the GPU into this CPU memory.

float CPU_CUDA_FIS_ImplicationResults[ SAMPLE_RATE * Rval ];

This is the input data and a function that you can write to fill the data. I will not use the array or this function, but this is to show you where to put it. I have the calls commented out to this function and variable in main where you would want to use it. We are going to evaluate a single FIS for one input so you can verify your calcuation in the program in main.

//This is just the inputs that we will pass down to the FIS
//float inputs[1000][2];
//You can write your own loader, this is just a hard coded set of inputs
//int populateInputs()
//{
//...

FINALLY, a function! OK, this is how we setup the GPU and its memory. We call CUT_DEVICE_INIT() to get the device setup, then we allocate global GPU memory (cudaMalloc calls). Next, we setup descriptors and cuda arrays for our texture data (cudaCreateChannelDesc and cudaMallocArray). After that, we need to bind the array memory to the textures.

//CUDA initilization
__host__ void InitCUDA(float **I1R1, float **I1R2, float **I2R1, float **I2R2, int R)
{
	
	////////////////////////////////////////////////////////////////////////////////////////
	
	//Makes sure there is a device and set it up
    	CUT_DEVICE_INIT();
    
	////////////////////////////////////////////////////////////////////////////////////////
    
	//allocate global GPU memory 
	
    	//create the space down on the device (global memory)
    	//Linear memory is allocated using cudaMalloc() or cudaMallocPitch() and freed using cudaFree()
    	CUDA_SAFE_CALL( cudaMalloc( (void**) &(*I1R1), R * sizeof(float) ) );   
    	//CUDA_SAFE_CALL( cudaMalloc( (void**) &(*I1R2), R * sizeof(float) ) );   
    	//CUDA_SAFE_CALL( cudaMalloc( (void**) &(*I2R1), R * sizeof(float) ) );   
    	//CUDA_SAFE_CALL( cudaMalloc( (void**) &(*I2R2), R * sizeof(float) ) );   
    	CUDA_SAFE_CALL( cudaMalloc( (void**) &CUDA_FIS_ImplicationResults, R * SAMPLE_RATE * sizeof(float) ) );            
    	CUDA_SAFE_CALL( cudaMalloc( (void**) &CUDA_FIS_ImplicationResults2, R * SAMPLE_RATE * sizeof(float) ) );   

	////////////////////////////////////////////////////////////////////////////////////////            
            
    	//Allocate the space on the GPU for the input data and setup the texture references 

    	//cudaMallocArray() requires a format description created using cudaCreateChannelDesc()
    	FIS_Inputs_tex_channelDesc = cudaCreateChannelDesc<float>();
    	//CUDA arrays are allocated using cudaMallocArray() and freed using cudaFreeArray()
    	CUDA_SAFE_CALL( cudaMallocArray( &CUDA_FIS_Inputs, &FIS_Inputs_tex_channelDesc, 1, 2 )); 

 	//cudaMallocArray() requires a format description created using cudaCreateChannelDesc()
 	FIS_RulesAndFuzzySets_tex_channelDesc = cudaCreateChannelDesc&lgt;float>();
 	//CUDA arrays are allocated using cudaMallocArray() and freed using cudaFreeArray()
    	CUDA_SAFE_CALL( cudaMallocArray( &CUDA_FIS_RulesAndFuzzySets, &FIS_RulesAndFuzzySets_tex_channelDesc, NUM_PARAMS * 2, R ));                 
               
    	//cudaMallocArray() requires a format description created using cudaCreateChannelDesc()
    	FIS_DiscreteConsequents_tex_channelDesc = cudaCreateChannelDesc<float>();
    	//CUDA arrays are allocated using cudaMallocArray() and freed using cudaFreeArray()
    	CUDA_SAFE_CALL( cudaMallocArray( &CUDA_FIS_DiscreteConsequents, &FIS_DiscreteConsequents_tex_channelDesc, SAMPLE_RATE, R ));                 

	////////////////////////////////////////////////////////////////////////////////////////               
               
	//Step that 'binds' the array memory to a 'texture' 
	
	//Before a kernel can use a texture reference to read from texture memory, the texture reference must be bound to a texture using cudaBindTexture() or cudaBindTextureToArray()
    	// set texture parameters (read the doc to set the texture sampling behavior)
    	FIS_Inputs_tex.addressMode[0] = cudaAddressModeClamp; 
    	FIS_Inputs_tex.addressMode[1] = cudaAddressModeClamp; 
    	FIS_Inputs_tex.filterMode = cudaFilterModePoint; 
    	FIS_Inputs_tex.normalized = false;    //normalized specifies whether texture coordinates are normalized or not.  If it is non-zero, all elements in the texture are addressed with texture coordinates in the range [0,1] rather than in the range [0,width-1] or [0,height-1], where width and height are the texture sizes

    	//Bind the array to the texture
    	//Before a kernel can use a texture reference to read from texture memory, the texture reference must be bound to a texture using cudaBindTexture() or cudaBindTextureToArray().
    	// cudaBindTexture: binds size bytes of the memory area pointed to by devPtr to the texture reference texRef
    	// cudaBindTextureToArray: binds the CUDA array cu_array to the texture reference texRef
    	CUDA_SAFE_CALL( cudaBindTextureToArray( FIS_Inputs_tex, CUDA_FIS_Inputs, FIS_Inputs_tex_channelDesc));
          
	//Before a kernel can use a texture reference to read from texture memory, the texture reference must be bound to a texture using cudaBindTexture() or cudaBindTextureToArray()
    	// set texture parameters (read the doc to set the texture sampling behavior)
    	FIS_RulesAndFuzzySets_tex.addressMode[0] = cudaAddressModeClamp; 
    	FIS_RulesAndFuzzySets_tex.addressMode[1] = cudaAddressModeClamp; 
    	FIS_RulesAndFuzzySets_tex.filterMode = cudaFilterModePoint; 
    	FIS_RulesAndFuzzySets_tex.normalized = false;    //normalized specifies whether texture coordinates are normalized or not.  If it is non-zero, all elements in the texture are addressed with texture coordinates in the range [0,1] rather than in the range [0,width-1] or [0,height-1], where width and height are the texture sizes

    	//Bind the array to the texture
    	//Before a kernel can use a texture reference to read from texture memory, the texture reference must be bound to a texture using cudaBindTexture() or cudaBindTextureToArray().
    	// cudaBindTexture: binds size bytes of the memory area pointed to by devPtr to the texture reference texRef
    	// cudaBindTextureToArray: binds the CUDA array cu_array to the texture reference texRef
    	CUDA_SAFE_CALL( cudaBindTextureToArray( FIS_RulesAndFuzzySets_tex, CUDA_FIS_RulesAndFuzzySets, FIS_RulesAndFuzzySets_tex_channelDesc));  
          
	//Before a kernel can use a texture reference to read from texture memory, the texture reference must be bound to a texture using cudaBindTexture() or cudaBindTextureToArray()
    	// set texture parameters (read the doc to set the texture sampling behavior)
    	FIS_DiscreteConsequents_tex.addressMode[0] = cudaAddressModeClamp; 
    	FIS_DiscreteConsequents_tex.addressMode[1] = cudaAddressModeClamp; 
    	FIS_DiscreteConsequents_tex.filterMode = cudaFilterModePoint; 
    	FIS_DiscreteConsequents_tex.normalized = false;    //normalized specifies whether texture coordinates are normalized or not.  If it is non-zero, all elements in the texture are addressed with texture coordinates in the range [0,1] rather than in the range [0,width-1] or [0,height-1], where width and height are the texture sizes

    	//Bind the array to the texture
    	//Before a kernel can use a texture reference to read from texture memory, the texture reference must be bound to a texture using cudaBindTexture() or cudaBindTextureToArray().
    	// cudaBindTexture: binds size bytes of the memory area pointed to by devPtr to the texture reference texRef
    	// cudaBindTextureToArray: binds the CUDA array cu_array to the texture reference texRef
	CUDA_SAFE_CALL( cudaBindTextureToArray( FIS_DiscreteConsequents_tex, CUDA_FIS_DiscreteConsequents, FIS_DiscreteConsequents_tex_channelDesc));  

	////////////////////////////////////////////////////////////////////////////////////////          
                 
}

The next function is for transferring the CPU memory to the GPU. We call cudaMemcpyToArray and copy the input, the rules and fuzzy sets, and the discrete consequents data. The inputs (FIS_Inputs) can still be continuously updated below, this is just initialization.

//Setup GPU memory
__host__ void GPUInitMemory(float *FIS_Inputs, float FIS_RulesAndFuzzySets[][2*NUM_PARAMS], int R)
{
 
    //Copy the image to the CUDA array
    CUDA_SAFE_CALL( cudaMemcpyToArray( CUDA_FIS_Inputs, 0, 0, (void*)FIS_Inputs, sizeof(float)*2, cudaMemcpyHostToDevice));
    //Copy the image to the CUDA array
    CUDA_SAFE_CALL( cudaMemcpyToArray( CUDA_FIS_RulesAndFuzzySets, 0, 0, (void*)FIS_RulesAndFuzzySets, sizeof(float)*R*2*NUM_PARAMS, cudaMemcpyHostToDevice));
    //Copy the image to the CUDA array
    CUDA_SAFE_CALL( cudaMemcpyToArray( CUDA_FIS_DiscreteConsequents, 0, 0, (void*)FIS_DiscreteConsequents, sizeof(float)*R*SAMPLE_RATE, cudaMemcpyHostToDevice));

}

This function initializes the rules and fuzzy sets and creates the discrete sampling of the respective fuzzy sets for the consequent domain. You will obviously populate your own data here, but what we have done is create 2 antecedent fuzzy sets ([-0.6 0 0 0.6] and [0.4 1 1 1.6]) and rotate between different consequents (a very simple FIS so you can verify your programs outputs). The two consequent domain fuzzy sets are [0.2 0.4 0.6 0.8] and [-0.8 -0.1 0.1 0.8]. The last part of the function is where the consequent domains are sampled for each rule and its respective fuzzy set.

//This sets up the rules and fuzzy sets
void CPUGetRulesAndFuzzySets()
{

	int i,j;

	//here are our 'hard coded' inputs (you can obviously read from file or etc later)
	for(i=0;i<Rval;i++){
		FIS_RulesAndFuzzySets[i][0] = -0.6;		//ant1
		FIS_RulesAndFuzzySets[i][1] = 0;
		FIS_RulesAndFuzzySets[i][2] = 0;
		FIS_RulesAndFuzzySets[i][3] = 0.6;
		FIS_RulesAndFuzzySets[i][4] = 0.4;		//ant2
		FIS_RulesAndFuzzySets[i][5] = 1;
		FIS_RulesAndFuzzySets[i][6] = 1;
		FIS_RulesAndFuzzySets[i][7] = 1.6;
		//Rotate between conseq's for rules	
		if( i % 2 == 0 ){
			FIS_ConsequentFuzzySets[i][0] = 0.2;		//consq
			FIS_ConsequentFuzzySets[i][1] = 0.4;
			FIS_ConsequentFuzzySets[i][2] = 0.6;
			FIS_ConsequentFuzzySets[i][3] = 0.8;
		}else{
			FIS_ConsequentFuzzySets[i][0] = -0.8;		//consq
			FIS_ConsequentFuzzySets[i][1] = -0.1;
			FIS_ConsequentFuzzySets[i][2] = 0.1;
			FIS_ConsequentFuzzySets[i][3] = 0.8;
		}
	}

	//here is where we make up the discrete consequent values
	for(i=0;i<Rval;i++){
		//Calc the value at each point
		for(j=0;j<SAMPLE_RATE;j++){
			FIS_DiscreteConsequents[i][j] = TRAPCALC((float)j/(float)SAMPLE_RATE,FIS_ConsequentFuzzySets[i][0],FIS_ConsequentFuzzySets[i][1],FIS_ConsequentFuzzySets[i][2],FIS_ConsequentFuzzySets[i][3]);
		}
	}
}

This function just sets the input to [0.2 0.5] for testing purposes. You can read in your data from file or etc instead of calling this function.

void CPUGetInputs()
{
	//these are our 'hard coded' inputs (you can obviously read from file or etc later)
	FIS_Inputs[0] = 0.2;
	FIS_Inputs[1] = 0.5;
}

This is main. We call our setup routines, then we would loop 1000 times (1000 rule-base firings), but I have commented it out. We will see a single firing, again, so you can check the output. In the commented out for loop, you can pass down your input data, then we call 3 functions for fuzzy logic on the GPU in CUDA. These will be explained next and are discussed in the paper. Lastly, we can print out the results and exit. I did not clean up system (CPU) or GPU memory, so you can be a good programmer and call the routines to dealloc the memory!

int main( int argc, char** argv ) 
{

	int i;

	//populateInputs();
	CPUInitMemory();
	CPUGetRulesAndFuzzySets();
	CPUGetInputs();
	InitCUDA(&Input1_Result1,&Input1_Result2,&Input2_Result1,&Input2_Result2,Rval);	
	GPUInitMemory(FIS_Inputs, FIS_RulesAndFuzzySets,Rval);	
	
	//Run 1000 test FIS rule-base firings
	//for(i=0;i<1000;i++){

	    	//Setup the inputs here if you like 
		//FIS_Inputs[0] = inputs[i][0];
		//FIS_Inputs[1] = inputs[i][1];
		//CUDA_SAFE_CALL( cudaMemcpyToArray( CUDA_FIS_Inputs, 0, 0, (void*)FIS_Inputs, sizeof(float)*2, cudaMemcpyHostToDevice));

		CPU_FuzzificationAndAntecedentCombine(Rval,Input1_Result1,Input1_Result2,Input2_Result1,Input2_Result2,&CPU_Input1_Result1,&CPU_Input1_Result2,&CPU_Input2_Result1,&CPU_Input2_Result2);	
		int flagresult = CPU_Reduction(Rval,Input1_Result1,Input1_Result2,Input2_Result1,Input2_Result2,&CPU_Input1_Result1,&CPU_Input1_Result2,&CPU_Input2_Result1,&CPU_Input2_Result2);
		CPU_Defuzzification(flagresult,Rval,Input1_Result1,Input1_Result2,Input2_Result1,Input2_Result2,&CPU_Input1_Result1,&CPU_Input1_Result2,&CPU_Input2_Result1,&CPU_Input2_Result2);

	//}

	//Result is in: CPU_CUDA_FIS_ImplicationResults[0]/CPU_CUDA_FIS_ImplicationResults[SAMPLE_RATE]

	//CUDA exit
	CUT_EXIT(argc, argv);	

}

This is the CPU side function that sets up and calls the GPU first step algorithm. In this function, we setup a CUDA processing context of 32 threads in one block (the second dimension). If you want to, you can put Rval here, but the following GPU code is designed to operate on two antecedents. The reason why I have not used Rvalue and specialized the GPU code to our specific rule-base profile is because it is faster to unroll all of your code and optimize steps if you know the exact profile. If you like, you can put a for loop in the GPU code making it dynamic/generic w.r.t. the number of antecedents, and make Rval a param here (it takes two seconds to do!). We will use one block in our grid. Next, we call our function (Fuzzification_And_Antecedent) and execute it on the GPU. Then, we sync up the threads are we are done. In the GPU function, we sample the inputs, calculate their membership values, and then combine the rule antecedents using the min operator.

//Run fuzzification and combine the antecedents
__host__ void CPU_FuzzificationAndAntecedentCombine(int R, float *Gdata_input11, float *Gdata_input12, float *Gdata_input21, float *Gdata_input22, float **CPUResult11, float **CPUResult12, float **CPUResult21, float **CPUResult22)
{

	//The magic performance step

	int local_function_maximum_threads_per_block = 128; //MAX_NUMBER_OF_THREADS_PER_BLOCK;

	int num_of_threads;
	if( R < local_function_maximum_threads_per_block ) num_of_threads = R;
	else num_of_threads = local_function_maximum_threads_per_block;

	int num_of_blocks;
	num_of_blocks = ceil( (float)R / (float)local_function_maximum_threads_per_block );

	//How many threads per block
    	dim3 dimBlock(1, num_of_threads, 1);
    	//How many blocks per grid
    	dim3 dimGrid(1, num_of_blocks, 1);

    	Fuzzification_And_Antecedent<<< dimGrid, dimBlock, 0 >>>( Gdata_input11 );     
    	//CUDA_SAFE_CALL( cudaMemcpy( (*CPUResult11), Gdata_input11, sizeof(float)*R, cudaMemcpyDeviceToHost) );     

	//sync up the threads
	CUDA_SAFE_CALL( cudaThreadSynchronize() );

}
//Run fuzzification and implication
__global__ void Fuzzification_And_Antecedent( float* g_outputdata ) 
{
	//Get our indexing information
	unsigned int rule_index = blockIdx.y*blockDim.y + threadIdx.y;

	//Sample from the input
	float x1 = tex2D(FIS_Inputs_tex, 0, 0);
	float x2 = tex2D(FIS_Inputs_tex, 0, 1);

	float a1 = tex2D(FIS_RulesAndFuzzySets_tex, 0, rule_index);
	float b1 = tex2D(FIS_RulesAndFuzzySets_tex, 1, rule_index);
	float c1 = tex2D(FIS_RulesAndFuzzySets_tex, 2, rule_index);
	float d1 = tex2D(FIS_RulesAndFuzzySets_tex, 3, rule_index);

	float a2 = tex2D(FIS_RulesAndFuzzySets_tex, 4, rule_index);
	float b2 = tex2D(FIS_RulesAndFuzzySets_tex, 5, rule_index);
	float c2 = tex2D(FIS_RulesAndFuzzySets_tex, 6, rule_index);
	float d2 = tex2D(FIS_RulesAndFuzzySets_tex, 7, rule_index);

	float fuzz1 = max( min(min((x1-a1) / (b1-a1),(d1-x1) / (d1-c1)),1.0) , 0.0 );
	float fuzz2 = max( min(min((x2-a2) / (b2-a2),(d2-x2) / (d2-c2)),1.0) , 0.0 );

	//write to global memory
	g_outputdata[rule_index] = min(fuzz1,fuzz2);
}

In this GPU function, we start the process of reduction. In reduction, we take a stream of elements and combine them using some operator, such as min. We use reduction here to reduce each consequent domain element over the set of rules (and use the max). We first read the data into shared memory. Shared memory is faster than global, so each thread pulls its data into shared memory, then we sync threads. After this point, reduction is performed on shared memory, and tid==0 will write the results out (the maximum memb value and that value times the x consequent domain element, the two eventual parts needed to calc the centroid).

//Perform vertical reduction 
__host__ int CPU_Reduction(int R, float *Gdata_input11, float *Gdata_input12, float *Gdata_input21, float *Gdata_input22, float **CPUResult11, float **CPUResult12, float **CPUResult21, float **CPUResult22)
{

	int thread_warp_ratio = 64;		//where 32 is a single NV8800 warp size, so 2 warps
	int flag = 0;				//which input and output buffer to sample from		
						// if flag == 0, then the input data is in CUDA_FIS_ImplicationResults
						// if flag == 1, then the input data is in CUDA_FIS_ImplicationResults2

	//Used in reduction below
	int currentR = R;						
	int NB = (float) ceil( (float) currentR / (float) thread_warp_ratio );		

	while(1){

		//The last step
		if( NB == 1 ){

			//How many threads per block
			dim3 dimBlock2(1, currentR, 1);
			//How many blocks per grid
			dim3 dimGrid2(SAMPLE_RATE, 1, 1);

			//Run the GPU code
			if( flag == 0 ) LastStepReductionLinear<<< dimGrid2, dimBlock2, 0 >>>( CUDA_FIS_ImplicationResults , CUDA_FIS_ImplicationResults2 , Input1_Result1 );  
			else LastStepReductionLinear<<< dimGrid2, dimBlock2, 0 >>>( CUDA_FIS_ImplicationResults2 , CUDA_FIS_ImplicationResults , Input1_Result1 );  
			flag = !flag;

			//sync up the threads
			CUDA_SAFE_CALL( cudaThreadSynchronize() );

			break;

		}
 
		//How many threads per block
		dim3 dimBlock(1, thread_warp_ratio, 1);
		//How many blocks per grid
		dim3 dimGrid(SAMPLE_RATE, NB, 1);

		//Run the GPU code
		if( flag == 0 ) ReductionLinear<<< dimGrid, dimBlock, 0 >>>( CUDA_FIS_ImplicationResults , CUDA_FIS_ImplicationResults2 );  
		else ReductionLinear<<< dimGrid, dimBlock, 0 >>>( CUDA_FIS_ImplicationResults2 , CUDA_FIS_ImplicationResults );  
		flag = !flag;

		currentR = currentR / thread_warp_ratio;
		NB = (float) ceil( (float) currentR / (float) thread_warp_ratio );	

		//sync up the threads
		CUDA_SAFE_CALL( cudaThreadSynchronize() );

	}

	//Pull back the data
	//if( flag == 0 ) CUDA_SAFE_CALL( cudaMemcpy( CPU_CUDA_FIS_ImplicationResults, CUDA_FIS_ImplicationResults, sizeof(float)*R*SAMPLE_RATE, cudaMemcpyDeviceToHost) ); 
	//else CUDA_SAFE_CALL( cudaMemcpy( CPU_CUDA_FIS_ImplicationResults, CUDA_FIS_ImplicationResults2, sizeof(float)*R*SAMPLE_RATE, cudaMemcpyDeviceToHost) ); 

	return flag;

}
//Kernel for reduction 
__global__ void ReductionLinear( float* g_input , float* g_output ) 
{

	__shared__ float sdata[64];

	unsigned int tid = threadIdx.y;
	unsigned int i = blockIdx.y*blockDim.y + threadIdx.y;
	unsigned int j = blockIdx.x;

	//read in two values at start so we can use less threads (more efficent)
	sdata[tid] = g_input[i*SAMPLE_RATE + j];
	__syncthreads();
	//Do shared memory reduction
	unsigned int s;
	for(s=blockDim.y/2;s>0;s>>=1){
		if( tid < s ) sdata[tid] = max(sdata[tid],sdata[tid+s]);
		__syncthreads();
	}
	if( tid==0 ) g_output[blockIdx.y*SAMPLE_RATE + j] = sdata[0];

}
//Kernel for reduction 
__global__ void LastStepReductionLinear( float* g_input , float* g_output , float* ant_vals ) 
{

	__shared__ float sdata[64];

	unsigned int tid = threadIdx.y;
	unsigned int i = blockIdx.y*blockDim.y + threadIdx.y;
	unsigned int j = blockIdx.x;

	sdata[tid] = min(tex2D(FIS_DiscreteConsequents_tex, j, i) , ant_vals[i] );
	__syncthreads();
	//Do shared memory reduction
	unsigned int s;
	for(s=blockDim.y/2;s>0;s>>=1){
		if( tid < s ) sdata[tid] = max(sdata[tid],sdata[tid+s]);
		__syncthreads();
	}
	if( tid==0 ){ 
		g_output[blockIdx.y + j] =  ((float)blockIdx.x/(float)SAMPLE_RATE) * sdata[0];
		g_output[blockIdx.y + SAMPLE_RATE + j] = sdata[0];	
	}

}

Here is where we reduce the X and X*mu values (separately), i.e. the top and bottom halfs of the centroid calculation.

//Perform horizontal reduction
__host__ void CPU_Defuzzification(int flagresult, int R, float *Gdata_input11, float *Gdata_input12, float *Gdata_input21, float *Gdata_input22, float **CPUResult11, float **CPUResult12, float **CPUResult21, float **CPUResult22)
{

	// if flagresult == 0, then the input data is in CUDA_FIS_ImplicationResults
	// if flagresult == 1, then the input data is in CUDA_FIS_ImplicationResults2

	int thread_warp_ratio = 64;		//where 32 is a single NV8800 warp size, so 2 warps
	int flag = flagresult;			//which input and output buffer to sample from		
						// if flag == 0, then the input data is in CUDA_FIS_ImplicationResults
						// if flag == 1, then the input data is in CUDA_FIS_ImplicationResults2

	//Used in reduction below
	int currentR = SAMPLE_RATE;						
	int NB = (float) ceil( (float) currentR / (float) thread_warp_ratio );		

	while(1){

		//The last step
		if( NB == 1 ){

			//How many threads per block
			dim3 dimBlock2(currentR, 1, 1);
			//How many blocks per grid
			dim3 dimGrid2(1, 2, 1);

			//Run the GPU code
			if( flag == 0 ) HorizontalReductionLinear<<< dimGrid2, dimBlock2, 0 >>>( CUDA_FIS_ImplicationResults , CUDA_FIS_ImplicationResults2 );  
			else HorizontalReductionLinear<<< dimGrid2, dimBlock2, 0 >>>( CUDA_FIS_ImplicationResults2 , CUDA_FIS_ImplicationResults );  
			flag = !flag;

			//sync up the threads
			CUDA_SAFE_CALL( cudaThreadSynchronize() );

			break;

		}
 
		//How many threads per block
		dim3 dimBlock(thread_warp_ratio, 1, 1);
		//How many blocks per grid
		dim3 dimGrid(NB, 2, 1);

		//Run the GPU code
		if( flag == 0 ) HorizontalReductionLinear<<< dimGrid, dimBlock, 0 >>>( CUDA_FIS_ImplicationResults , CUDA_FIS_ImplicationResults2 );  
		else HorizontalReductionLinear<<< dimGrid, dimBlock, 0 >>>( CUDA_FIS_ImplicationResults2 , CUDA_FIS_ImplicationResults );  
		flag = !flag;

		currentR = currentR / thread_warp_ratio;
		NB = (float) ceil( (float) currentR / (float) thread_warp_ratio );	

		//sync up the threads
		CUDA_SAFE_CALL( cudaThreadSynchronize() );

	}

	//Pull back the data
	//if( flag == 0 ) CUDA_SAFE_CALL( cudaMemcpy( CPU_CUDA_FIS_ImplicationResults, CUDA_FIS_ImplicationResults, sizeof(float)*R*SAMPLE_RATE, cudaMemcpyDeviceToHost) ); 
	//else CUDA_SAFE_CALL( cudaMemcpy( CPU_CUDA_FIS_ImplicationResults, CUDA_FIS_ImplicationResults2, sizeof(float)*R*SAMPLE_RATE, cudaMemcpyDeviceToHost) ); 

	if( flag == 0 )CUDA_SAFE_CALL( cudaMemcpy( CPU_CUDA_FIS_ImplicationResults, CUDA_FIS_ImplicationResults, sizeof(float)*(SAMPLE_RATE+1), cudaMemcpyDeviceToHost) ); 
	else CUDA_SAFE_CALL( cudaMemcpy( CPU_CUDA_FIS_ImplicationResults, CUDA_FIS_ImplicationResults2, sizeof(float)*(SAMPLE_RATE+1), cudaMemcpyDeviceToHost) ); 

}
//Kernel for horizontal reduction
__global__ void HorizontalReductionLinear( float* g_input , float* g_output ) 
{

	__shared__ float sdata[64];

	unsigned int tid = threadIdx.x;
	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
	unsigned int j = blockIdx.y;

	//read in two values at start so we can use less threads (more efficent)
	sdata[tid] = g_input[i + j*SAMPLE_RATE];

	__syncthreads();

	//Do shared memory reduction
	unsigned int s;
	for(s=blockDim.x/2;s>0;s>>=1){
		if( tid < s ) sdata[tid] += sdata[tid+s];
		__syncthreads();
	}

	if( tid==0 ) g_output[blockIdx.x + j*SAMPLE_RATE] = sdata[0];

}





* DOWNLOAD CUDA file Code For This Lesson.