0.4 Implementation with posix threads

Efficient real-time filter design Page 1 / 1

This module contains details of using POSIX threads to attempt to optimize a multi-channel filterbank implementation.

Motivation for posix threads

The popular POSIX (Portable Operating System Interface) Thread API provides a multi-platform interface for creating multi-threaded applications on a variety of UNIX platforms. Multi-threaded applications split the processing load across multiple cores in a computer. Most modern computers (including our test machine) contain CPUs with 2,4, or 8 individual cores, all of which can process data concurrently. Applications can be easily made parallel when significant portions of the data being operated on are independent of any other portion of data.

We recognized that p-threads presented a significant opportunity to improve the efficiency of our filter construction because each channel's data stream is completely independent of any other channel's data. Therefore, we sought to split the processing load for all the channels to multiple threads, all of which can run concurrently.

Implementation 1: compiler optimizations only

The first p-thread implementation is very similar to previous single-threaded implementations. The parameters and data structures are exactly the same as before. The filter coefficients were held constant for all channels (for speed). The primary difference here is that each thread is responsible for filtering a portion of the channels. For example, with 256 channels and 4 p-threads, use the following equal division of processing load:

Thread 0: process channels 0 - 63
Thread 1: process channels 64 - 127
Thread 2: process channels 128 - 191
Thread 3: process channels 192 - 255

The code was designed to handle arbitrary numbers of p-threads with the pre-condition that the number of threads evenly divides the number of channels. We did our test runs using the standard configuration (256 channels, 600,000 data points, 100 cycles) with 1, 2, 4, 8, 16 and 32 p-threads. A control run with all operations performed on the main thread was also executed. Results for all runs are shown in Table 1.

Number of Threads	Runtime (s)
Control (main thread)	48.075
1	64.302
2	96.755
4	123.931
8	67.629
16	141.329
32	121.134

Implementation 2: custom sse3 intrinsic optimizations

The results we obtained from implementation results appear promising, but note that all runs were slower than the fastest single-threaded implementation. We decided to apply the intrinsic operations from the SSE3 instruction set that we developed in the previous section to the p-thread applications. Note that for SSE3 to work with p-threads, the pre-condition for the number of p-threads is modified. SSE3 only operates on batches of 4 floats at a time, so the number of channels that each thread operates on must be divisible by 4. In essence:

\frac{n u m b e r o f c h a n n e l s}{n u m b e r o f t h r e a d s} * \frac{1}{4} \in Z

With the same standard run configuration, this pre-condition still supported test runs with 1, 2, 4, 8, 16 and 32 p-threads, along with a control run with execution on the main thread. The results of all runs are shown in Table 2.

Number of Threads	Runtime (s)
Control (main thread)	48.333
1	50.109
2	88.632
4	138.090
8	62.481
16	103.901
32	78.219

Analysis of results

Note from the figure that the p-thread speed takes much longer for low (between 1 and 4) and high (greater than 8) numbers of p-threads. Marginally decent performance occurs with 8 p-threads on our benchmark computer, which yielded a result of 67.6 seconds using compiler optimizations, and 62.5 seconds using our SSE3 code. Note that the run time for the single-threaded implementation runs at around 48 seconds.

The behavior here definitely does not seem intuitive. With higher processor utilization, the multi-threaded runs take longer than their single-threaded counterparts. After some thought, we concluded that three events were occurring:

Cache Missing: Each CPU core contains a cache which allows processors faster access to data. When a piece of data is requested, it is pulled from memory into cache in a chunk, called a “cache line”. Subsequent accesses to memory nearby the original access will be pulled from the cache, which is much faster. A data request which cannot be found in cache is called a cache “miss,” and a request that is found in cache is called a cache “hit.” When there are very few p-threads, each thread operates on a large portion of memory. Also, each channel's intermediate variables (in the code are in arrays named w1, w2, w3 and w4) are stored in different arrays, which spreads them out in memory.
Cache Poisoning: Each thread runs its own data so one thread will never copy over another thread's data. However, one thread will be operating on data that is nearby in memory to another thread's data. Each thread (running on each core) will be using its own cache. When a thread updates memory that is located in another thread's cache, it will inform the other thread that it is updating the data, and the other cache will mark that information as “dirty.” The subsequent read will occur from memory. We believe with really large numbers of p-threads (greater than 8), the threads operate on smaller chunks of memory that are closer together, resulting in a higher number of cache poisonings.
CPU Utilization and P-thread Overhead: The benchmark computer was equipped with a quad-core processor. This would normally imply that 4 p-threads is optimal, but advances in CPU architecture which are beyond the scope of this research project show that 2 p-threads run more effectively on each core. Also, as the number of p-threads increases, the overhead required to manage them grows, and the actual effectiveness diminishes.

Taking these issues into accounts, we focused our efforts into reorganizing the data structures used by the filter bank.

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Efficient real-time filter design for recording multichannel neural activity. OpenStax CNX. Dec 11, 2012 Download for free at http://cnx.org/content/col11461/1.1

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Efficient real-time filter design for recording multichannel neural activity' conversation and receive update notifications?

Ask

	28 AP Key Terms 28 Development Inheritance By OpenStax Start Key Terms
	28 AP 28 Development Inheritance MCQ By OpenStax Start Quiz
©flickr: Abraham	Biology Exam 3 By Vanessa Soledad Start Exam
	Clinical Psychology MCQ By Saylor Foundation Start Quiz
	9 Domain Driven Design By JavaChamp Team Start Quiz
	23 AP Key Terms 23 The Digestive System By OpenStax Start Key Terms
	8 Psychology MCQ 2011 1 Exam By John Gabrieli Start Exam
	42 Biology 42 The Immune System MCQ By OpenStax Start Quiz
	16 AP 16 Neurological MCQ Exam By OpenStax Start Quiz
	7 BOD Urinary Tract quiz By Brooke Delaney Start Exam