Hi everyone
I conducted a few experiments with a workload to compare the following parameters with this patchset and without this patchset: 1.The performance of the workload 2.The sum of the waitime to run of the processes queued on each cpu-the cumulative latency. 3.The number of migrations of tasks between cpus.
The observations and inferences are given below:
_Experimental setup:
_1.The workload is at the end of the mail.Every run of the workload was for 10s. 2.Different number of long running and short running threads were run each time. 3.The setup was on a two socket Pre-Nehalam machine,but one socket had all its cpus offlined.Thus only one socket was active throughout the experiment.The socket consisted of 4 cores. 4.The statistics below have been collected from /proc/schedstats except throughput which is output by the workload. -Latency has been observed from the eighth field in the cpu statistics in /proc/schedstat cpu<N> 1 2 3 4 5 6 7 "8" 9 -Number of migrations has been calculated by summing up the #pulls during the idle,busy and newly_idle states of all the cpus.This is also given by /proc/schedstats
5.FieldA->#short-running-tasks [For every 10ms passed sleep for 9ms,work for 1ms] a 10% task. FieldB->#long-running-tasks Field1->Throughput with patch (records/s read) Field2->Throughput without patch (records/s read) Field3->#Migrations with patch Field4->#Migrations without patch Field5->Latency with patch Field6->Latency without patch
A B 1 2 3 4 5 6 ------------------------------------------------------------------------------------- 5 5 49,93,368 48,68,351 108 28 22s 18.3s 4 2 34,37,669 34,37,547 58 50 0.6s 0.17s 16 0 38,66,597 38,74,580 1151 1014 1.88s 1.65s
_Inferences_
1.Clearly an increase in the number of pulls can be seen with this patch,this has resulted in an increase in the latency.This *should have* resulted in a decrease in throughput but in the first two cases this is not reflected.This could be due to some error in the benchmark itself or the way I am calculating the throughput.Keeping this issue aside,I focus on the #pulls and latency effect.
2.On integrating PJT's metric with the load balancer,#Migrations increase due to the following reason, which I figured out by going through the traces.
Task1 Task3 Task2 Task4 ------ ------ Group1 Group2
Case1:Load_as_per_pjt 1028 1121 Case2:Load_without_pjt 2048 2048
Fig1.
During load balancing Case1: Group2 is overloaded,one of the tasks is moved to Group1 Case2: Group1 and Group2 are equally loaded,hence no migrations
This is observed so many times,that it is no wonder that the #migrations have increased with this patch.Here Group refers to sched_group.
3.The next obvious step was be to see if so many migrations with my patch is prudent or not.The latency numbers reflect that it is not.
4.As I said earlier,I keep throughput out of these inferences because it distracts us from something that is stark clear *Migrations incurred due to PJT's metric is not affecting the tasks positively.*
5.The above is my first observation.This does not however say that using PJT's metric with the load balancer might be a bad idea.This could mean many things out which the correct one has to be figured out.Among them I list out a few.
a)Simply replacing the existing metric used by Load Balancer with PJT's metric might not really derive the benefit that PJT's metric has to offer. b)I have not been able to figure out what kind of workloads actually benefit from the way we have applied the PJT's metric.Maybe we are using a workload which is adversely getting affected.
6.My next step in my opinion will be to resolve the following issues in the decreasing order of priority:
a)Run some other benchmark like kernbench and find out if the throughput reflects increase in latency correctly.If it does,then I will need to find out why the current benchmark was behaving weird,else I will need to go through the traces to figure out this issue. b)If I find out that the throughput is consistent with the latency,then we need to modify the strictness(the granularity of time at which the load is getting updated) with which PJT's metric is calculating load,or use it in some other way in load balancing.
Looking forward to your feedback on this :)
--------------------------BEGIN WORKLOAD--------------------------------- /* * test.c - Two instances of this program is run.One instance where sleep * time is 0 and another instance which sleeps between regular instances * of time.This is done to create both long running and short running tasks * on the cpu. * * Multiple threads are created of each instance.The threads request for a * memory chunk,write into it and then free it.This is done throughout the * period of the run. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License as * published by the Free Software Foundation; version 2 of the License. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 * USA */
#include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <sys/mman.h> #include <pthread.h> #include <string.h> #include <time.h> #include <sys/time.h> #include <sys/resource.h> #include "malloc.h"
/* Variable entities */ static unsigned int seconds; static unsigned int threads; static unsigned int mem_chunk_size; static unsigned int sleep_at; static unsigned int sleep_interval;
/* Fixed entities */ typedef size_t mem_slot_t;/* 8 bytes */ static unsigned int slot_size = sizeof(mem_slot_t);
/* Other parameters */ static volatile int start; static time_t start_time; static unsigned int records_read; pthread_mutex_t records_count_lock = PTHREAD_MUTEX_INITIALIZER;
static unsigned int write_to_mem(void) { int i, j; mem_slot_t *scratch_pad, *temp; mem_chunk_size = slot_size * 256; mem_slot_t *end; sleep_at = 2800; /* sleep for every 2800 records-short runs,else sleep_at=0 */ sleep_interval = 9000; /* sleep for 9 ms */
for (i=0; start == 1; i++) { /* ask for a memory chunk */ scratch_pad = (mem_slot_t *)malloc(mem_chunk_size); if (scratch_pad == NULL) { fprintf(stderr,"Could not allocate memory\n"); exit(1); } end = scratch_pad + (mem_chunk_size / slot_size); /* write into this chunk */ for (temp = scratch_pad, j=0; temp < end; temp++, j++) *temp = (mem_slot_t)j;
/* Free this chunk */ free(scratch_pad);
/* Decide the duty cycle;currently 10 ms */ if (sleep_at && !(i % sleep_at)) usleep(sleep_interval);
} return (i); }
static void * thread_run(void *arg) {
unsigned int records_local;
/* Wait for the start signal */
while (start == 0);
records_local = write_to_mem();
pthread_mutex_lock(&records_count_lock); records_read += records_local; pthread_mutex_unlock(&records_count_lock);
return NULL; }
static void start_threads() { double diff_time; unsigned int i; int err; threads = 8; seconds = 10;
pthread_t thread_array[threads]; for (i = 0; i < threads; i++) { err = pthread_create(&thread_array[i], NULL, thread_run, NULL); if (err) { fprintf(stderr, "Error creating thread %d\n", i); exit(1); } } start_time = time(NULL); start = 1; sleep(seconds); start = 0; diff_time = difftime(time(NULL), start_time);
for (i = 0; i < threads; i++) { err = pthread_join(thread_array[i], NULL); if (err) { fprintf(stderr, "Error joining thread %d\n", i); exit(1); } } printf("%u records/s\n", (unsigned int) (((double) records_read)/diff_time));
} int main() { start_threads(); return 0; }
------------------------END WORKLOAD------------------------------------ Regards Preeti U Murthy