docs: move other kAPI documents to core-api

There are a number of random documents that seem to be
describing some aspects of the core-api. Move them to such
directory, adding them at the core-api/index.rst file.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/86d979ed183adb76af93a92f20189bccf97f0055.1592918949.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This commit is contained in:
Mauro Carvalho Chehab 2020-06-23 15:31:38 +02:00 committed by Jonathan Corbet
parent d8451dfc63
commit c9b54d6f36
5 changed files with 4 additions and 1 deletions

View file

@ -0,0 +1,220 @@
==========================================================
How to access I/O mapped memory from within device drivers
==========================================================
:Author: Linus
.. warning::
The virt_to_bus() and bus_to_virt() functions have been
superseded by the functionality provided by the PCI DMA interface
(see :doc:`/core-api/dma-api-howto`). They continue
to be documented below for historical purposes, but new code
must not use them. --davidm 00/12/12
::
[ This is a mail message in response to a query on IO mapping, thus the
strange format for a "document" ]
The AHA-1542 is a bus-master device, and your patch makes the driver give the
controller the physical address of the buffers, which is correct on x86
(because all bus master devices see the physical memory mappings directly).
However, on many setups, there are actually **three** different ways of looking
at memory addresses, and in this case we actually want the third, the
so-called "bus address".
Essentially, the three ways of addressing memory are (this is "real memory",
that is, normal RAM--see later about other details):
- CPU untranslated. This is the "physical" address. Physical address
0 is what the CPU sees when it drives zeroes on the memory bus.
- CPU translated address. This is the "virtual" address, and is
completely internal to the CPU itself with the CPU doing the appropriate
translations into "CPU untranslated".
- bus address. This is the address of memory as seen by OTHER devices,
not the CPU. Now, in theory there could be many different bus
addresses, with each device seeing memory in some device-specific way, but
happily most hardware designers aren't actually actively trying to make
things any more complex than necessary, so you can assume that all
external hardware sees the memory the same way.
Now, on normal PCs the bus address is exactly the same as the physical
address, and things are very simple indeed. However, they are that simple
because the memory and the devices share the same address space, and that is
not generally necessarily true on other PCI/ISA setups.
Now, just as an example, on the PReP (PowerPC Reference Platform), the
CPU sees a memory map something like this (this is from memory)::
0-2 GB "real memory"
2 GB-3 GB "system IO" (inb/out and similar accesses on x86)
3 GB-4 GB "IO memory" (shared memory over the IO bus)
Now, that looks simple enough. However, when you look at the same thing from
the viewpoint of the devices, you have the reverse, and the physical memory
address 0 actually shows up as address 2 GB for any IO master.
So when the CPU wants any bus master to write to physical memory 0, it
has to give the master address 0x80000000 as the memory address.
So, for example, depending on how the kernel is actually mapped on the
PPC, you can end up with a setup like this::
physical address: 0
virtual address: 0xC0000000
bus address: 0x80000000
where all the addresses actually point to the same thing. It's just seen
through different translations..
Similarly, on the Alpha, the normal translation is::
physical address: 0
virtual address: 0xfffffc0000000000
bus address: 0x40000000
(but there are also Alphas where the physical address and the bus address
are the same).
Anyway, the way to look up all these translations, you do::
#include <asm/io.h>
phys_addr = virt_to_phys(virt_addr);
virt_addr = phys_to_virt(phys_addr);
bus_addr = virt_to_bus(virt_addr);
virt_addr = bus_to_virt(bus_addr);
Now, when do you need these?
You want the **virtual** address when you are actually going to access that
pointer from the kernel. So you can have something like this::
/*
* this is the hardware "mailbox" we use to communicate with
* the controller. The controller sees this directly.
*/
struct mailbox {
__u32 status;
__u32 bufstart;
__u32 buflen;
..
} mbox;
unsigned char * retbuffer;
/* get the address from the controller */
retbuffer = bus_to_virt(mbox.bufstart);
switch (retbuffer[0]) {
case STATUS_OK:
...
on the other hand, you want the bus address when you have a buffer that
you want to give to the controller::
/* ask the controller to read the sense status into "sense_buffer" */
mbox.bufstart = virt_to_bus(&sense_buffer);
mbox.buflen = sizeof(sense_buffer);
mbox.status = 0;
notify_controller(&mbox);
And you generally **never** want to use the physical address, because you can't
use that from the CPU (the CPU only uses translated virtual addresses), and
you can't use it from the bus master.
So why do we care about the physical address at all? We do need the physical
address in some cases, it's just not very often in normal code. The physical
address is needed if you use memory mappings, for example, because the
"remap_pfn_range()" mm function wants the physical address of the memory to
be remapped as measured in units of pages, a.k.a. the pfn (the memory
management layer doesn't know about devices outside the CPU, so it
shouldn't need to know about "bus addresses" etc).
.. note::
The above is only one part of the whole equation. The above
only talks about "real memory", that is, CPU memory (RAM).
There is a completely different type of memory too, and that's the "shared
memory" on the PCI or ISA bus. That's generally not RAM (although in the case
of a video graphics card it can be normal DRAM that is just used for a frame
buffer), but can be things like a packet buffer in a network card etc.
This memory is called "PCI memory" or "shared memory" or "IO memory" or
whatever, and there is only one way to access it: the readb/writeb and
related functions. You should never take the address of such memory, because
there is really nothing you can do with such an address: it's not
conceptually in the same memory space as "real memory" at all, so you cannot
just dereference a pointer. (Sadly, on x86 it **is** in the same memory space,
so on x86 it actually works to just deference a pointer, but it's not
portable).
For such memory, you can do things like:
- reading::
/*
* read first 32 bits from ISA memory at 0xC0000, aka
* C000:0000 in DOS terms
*/
unsigned int signature = isa_readl(0xC0000);
- remapping and writing::
/*
* remap framebuffer PCI memory area at 0xFC000000,
* size 1MB, so that we can access it: We can directly
* access only the 640k-1MB area, so anything else
* has to be remapped.
*/
void __iomem *baseptr = ioremap(0xFC000000, 1024*1024);
/* write a 'A' to the offset 10 of the area */
writeb('A',baseptr+10);
/* unmap when we unload the driver */
iounmap(baseptr);
- copying and clearing::
/* get the 6-byte Ethernet address at ISA address E000:0040 */
memcpy_fromio(kernel_buffer, 0xE0040, 6);
/* write a packet to the driver */
memcpy_toio(0xE1000, skb->data, skb->len);
/* clear the frame buffer */
memset_io(0xA0000, 0, 0x10000);
OK, that just about covers the basics of accessing IO portably. Questions?
Comments? You may think that all the above is overly complex, but one day you
might find yourself with a 500 MHz Alpha in front of you, and then you'll be
happy that your driver works ;)
Note that kernel versions 2.0.x (and earlier) mistakenly called the
ioremap() function "vremap()". ioremap() is the proper name, but I
didn't think straight when I wrote it originally. People who have to
support both can do something like::
/* support old naming silliness */
#if LINUX_VERSION_CODE < 0x020100
#define ioremap vremap
#define iounmap vfree
#endif
at the top of their source files, and then they can use the right names
even on 2.0.x systems.
And the above sounds worse than it really is. Most real drivers really
don't do all that complex things (or rather: the complexity is not so
much in the actual IO accesses as in error handling and timeouts etc).
It's generally not hard to fix drivers, and in many cases the code
actually looks better afterwards::
unsigned long signature = *(unsigned int *) 0xC0000;
vs
unsigned long signature = readl(0xC0000);
I think the second version actually is more readable, no?

View file

@ -39,6 +39,8 @@ Library functionality that is used throughout the kernel.
rbtree
generic-radix-tree
packing
bus-virt-phys-mapping
this_cpu_ops
timekeeping
errseq
@ -82,6 +84,7 @@ more memory-management documentation in :doc:`/vm/index`.
:maxdepth: 1
memory-allocation
unaligned-memory-access
dma-api
dma-api-howto
dma-attributes

View file

@ -0,0 +1,339 @@
===================
this_cpu operations
===================
:Author: Christoph Lameter, August 4th, 2014
:Author: Pranith Kumar, Aug 2nd, 2014
this_cpu operations are a way of optimizing access to per cpu
variables associated with the *currently* executing processor. This is
done through the use of segment registers (or a dedicated register where
the cpu permanently stored the beginning of the per cpu area for a
specific processor).
this_cpu operations add a per cpu variable offset to the processor
specific per cpu base and encode that operation in the instruction
operating on the per cpu variable.
This means that there are no atomicity issues between the calculation of
the offset and the operation on the data. Therefore it is not
necessary to disable preemption or interrupts to ensure that the
processor is not changed between the calculation of the address and
the operation on the data.
Read-modify-write operations are of particular interest. Frequently
processors have special lower latency instructions that can operate
without the typical synchronization overhead, but still provide some
sort of relaxed atomicity guarantees. The x86, for example, can execute
RMW (Read Modify Write) instructions like inc/dec/cmpxchg without the
lock prefix and the associated latency penalty.
Access to the variable without the lock prefix is not synchronized but
synchronization is not necessary since we are dealing with per cpu
data specific to the currently executing processor. Only the current
processor should be accessing that variable and therefore there are no
concurrency issues with other processors in the system.
Please note that accesses by remote processors to a per cpu area are
exceptional situations and may impact performance and/or correctness
(remote write operations) of local RMW operations via this_cpu_*.
The main use of the this_cpu operations has been to optimize counter
operations.
The following this_cpu() operations with implied preemption protection
are defined. These operations can be used without worrying about
preemption and interrupts::
this_cpu_read(pcp)
this_cpu_write(pcp, val)
this_cpu_add(pcp, val)
this_cpu_and(pcp, val)
this_cpu_or(pcp, val)
this_cpu_add_return(pcp, val)
this_cpu_xchg(pcp, nval)
this_cpu_cmpxchg(pcp, oval, nval)
this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
this_cpu_sub(pcp, val)
this_cpu_inc(pcp)
this_cpu_dec(pcp)
this_cpu_sub_return(pcp, val)
this_cpu_inc_return(pcp)
this_cpu_dec_return(pcp)
Inner working of this_cpu operations
------------------------------------
On x86 the fs: or the gs: segment registers contain the base of the
per cpu area. It is then possible to simply use the segment override
to relocate a per cpu relative address to the proper per cpu area for
the processor. So the relocation to the per cpu base is encoded in the
instruction via a segment register prefix.
For example::
DEFINE_PER_CPU(int, x);
int z;
z = this_cpu_read(x);
results in a single instruction::
mov ax, gs:[x]
instead of a sequence of calculation of the address and then a fetch
from that address which occurs with the per cpu operations. Before
this_cpu_ops such sequence also required preempt disable/enable to
prevent the kernel from moving the thread to a different processor
while the calculation is performed.
Consider the following this_cpu operation::
this_cpu_inc(x)
The above results in the following single instruction (no lock prefix!)::
inc gs:[x]
instead of the following operations required if there is no segment
register::
int *y;
int cpu;
cpu = get_cpu();
y = per_cpu_ptr(&x, cpu);
(*y)++;
put_cpu();
Note that these operations can only be used on per cpu data that is
reserved for a specific processor. Without disabling preemption in the
surrounding code this_cpu_inc() will only guarantee that one of the
per cpu counters is correctly incremented. However, there is no
guarantee that the OS will not move the process directly before or
after the this_cpu instruction is executed. In general this means that
the value of the individual counters for each processor are
meaningless. The sum of all the per cpu counters is the only value
that is of interest.
Per cpu variables are used for performance reasons. Bouncing cache
lines can be avoided if multiple processors concurrently go through
the same code paths. Since each processor has its own per cpu
variables no concurrent cache line updates take place. The price that
has to be paid for this optimization is the need to add up the per cpu
counters when the value of a counter is needed.
Special operations
------------------
::
y = this_cpu_ptr(&x)
Takes the offset of a per cpu variable (&x !) and returns the address
of the per cpu variable that belongs to the currently executing
processor. this_cpu_ptr avoids multiple steps that the common
get_cpu/put_cpu sequence requires. No processor number is
available. Instead, the offset of the local per cpu area is simply
added to the per cpu offset.
Note that this operation is usually used in a code segment when
preemption has been disabled. The pointer is then used to
access local per cpu data in a critical section. When preemption
is re-enabled this pointer is usually no longer useful since it may
no longer point to per cpu data of the current processor.
Per cpu variables and offsets
-----------------------------
Per cpu variables have *offsets* to the beginning of the per cpu
area. They do not have addresses although they look like that in the
code. Offsets cannot be directly dereferenced. The offset must be
added to a base pointer of a per cpu area of a processor in order to
form a valid address.
Therefore the use of x or &x outside of the context of per cpu
operations is invalid and will generally be treated like a NULL
pointer dereference.
::
DEFINE_PER_CPU(int, x);
In the context of per cpu operations the above implies that x is a per
cpu variable. Most this_cpu operations take a cpu variable.
::
int __percpu *p = &x;
&x and hence p is the *offset* of a per cpu variable. this_cpu_ptr()
takes the offset of a per cpu variable which makes this look a bit
strange.
Operations on a field of a per cpu structure
--------------------------------------------
Let's say we have a percpu structure::
struct s {
int n,m;
};
DEFINE_PER_CPU(struct s, p);
Operations on these fields are straightforward::
this_cpu_inc(p.m)
z = this_cpu_cmpxchg(p.m, 0, 1);
If we have an offset to struct s::
struct s __percpu *ps = &p;
this_cpu_dec(ps->m);
z = this_cpu_inc_return(ps->n);
The calculation of the pointer may require the use of this_cpu_ptr()
if we do not make use of this_cpu ops later to manipulate fields::
struct s *pp;
pp = this_cpu_ptr(&p);
pp->m--;
z = pp->n++;
Variants of this_cpu ops
------------------------
this_cpu ops are interrupt safe. Some architectures do not support
these per cpu local operations. In that case the operation must be
replaced by code that disables interrupts, then does the operations
that are guaranteed to be atomic and then re-enable interrupts. Doing
so is expensive. If there are other reasons why the scheduler cannot
change the processor we are executing on then there is no reason to
disable interrupts. For that purpose the following __this_cpu operations
are provided.
These operations have no guarantee against concurrent interrupts or
preemption. If a per cpu variable is not used in an interrupt context
and the scheduler cannot preempt, then they are safe. If any interrupts
still occur while an operation is in progress and if the interrupt too
modifies the variable, then RMW actions can not be guaranteed to be
safe::
__this_cpu_read(pcp)
__this_cpu_write(pcp, val)
__this_cpu_add(pcp, val)
__this_cpu_and(pcp, val)
__this_cpu_or(pcp, val)
__this_cpu_add_return(pcp, val)
__this_cpu_xchg(pcp, nval)
__this_cpu_cmpxchg(pcp, oval, nval)
__this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
__this_cpu_sub(pcp, val)
__this_cpu_inc(pcp)
__this_cpu_dec(pcp)
__this_cpu_sub_return(pcp, val)
__this_cpu_inc_return(pcp)
__this_cpu_dec_return(pcp)
Will increment x and will not fall-back to code that disables
interrupts on platforms that cannot accomplish atomicity through
address relocation and a Read-Modify-Write operation in the same
instruction.
&this_cpu_ptr(pp)->n vs this_cpu_ptr(&pp->n)
--------------------------------------------
The first operation takes the offset and forms an address and then
adds the offset of the n field. This may result in two add
instructions emitted by the compiler.
The second one first adds the two offsets and then does the
relocation. IMHO the second form looks cleaner and has an easier time
with (). The second form also is consistent with the way
this_cpu_read() and friends are used.
Remote access to per cpu data
------------------------------
Per cpu data structures are designed to be used by one cpu exclusively.
If you use the variables as intended, this_cpu_ops() are guaranteed to
be "atomic" as no other CPU has access to these data structures.
There are special cases where you might need to access per cpu data
structures remotely. It is usually safe to do a remote read access
and that is frequently done to summarize counters. Remote write access
something which could be problematic because this_cpu ops do not
have lock semantics. A remote write may interfere with a this_cpu
RMW operation.
Remote write accesses to percpu data structures are highly discouraged
unless absolutely necessary. Please consider using an IPI to wake up
the remote CPU and perform the update to its per cpu area.
To access per-cpu data structure remotely, typically the per_cpu_ptr()
function is used::
DEFINE_PER_CPU(struct data, datap);
struct data *p = per_cpu_ptr(&datap, cpu);
This makes it explicit that we are getting ready to access a percpu
area remotely.
You can also do the following to convert the datap offset to an address::
struct data *p = this_cpu_ptr(&datap);
but, passing of pointers calculated via this_cpu_ptr to other cpus is
unusual and should be avoided.
Remote access are typically only for reading the status of another cpus
per cpu data. Write accesses can cause unique problems due to the
relaxed synchronization requirements for this_cpu operations.
One example that illustrates some concerns with write operations is
the following scenario that occurs because two per cpu variables
share a cache-line but the relaxed synchronization is applied to
only one process updating the cache-line.
Consider the following example::
struct test {
atomic_t a;
int b;
};
DEFINE_PER_CPU(struct test, onecacheline);
There is some concern about what would happen if the field 'a' is updated
remotely from one processor and the local processor would use this_cpu ops
to update field b. Care should be taken that such simultaneous accesses to
data within the same cache line are avoided. Also costly synchronization
may be necessary. IPIs are generally recommended in such scenarios instead
of a remote write to the per cpu area of another processor.
Even in cases where the remote writes are rare, please bear in
mind that a remote write will evict the cache line from the processor
that most likely will access it. If the processor wakes up and finds a
missing local cache line of a per cpu area, its performance and hence
the wake up times will be affected.

View file

@ -0,0 +1,265 @@
=========================
Unaligned Memory Accesses
=========================
:Author: Daniel Drake <dsd@gentoo.org>,
:Author: Johannes Berg <johannes@sipsolutions.net>
:With help from: Alan Cox, Avuton Olrich, Heikki Orsila, Jan Engelhardt,
Kyle McMartin, Kyle Moffett, Randy Dunlap, Robert Hancock, Uli Kunitz,
Vadim Lobanov
Linux runs on a wide variety of architectures which have varying behaviour
when it comes to memory access. This document presents some details about
unaligned accesses, why you need to write code that doesn't cause them,
and how to write such code!
The definition of an unaligned access
=====================================
Unaligned memory accesses occur when you try to read N bytes of data starting
from an address that is not evenly divisible by N (i.e. addr % N != 0).
For example, reading 4 bytes of data from address 0x10004 is fine, but
reading 4 bytes of data from address 0x10005 would be an unaligned memory
access.
The above may seem a little vague, as memory access can happen in different
ways. The context here is at the machine code level: certain instructions read
or write a number of bytes to or from memory (e.g. movb, movw, movl in x86
assembly). As will become clear, it is relatively easy to spot C statements
which will compile to multiple-byte memory access instructions, namely when
dealing with types such as u16, u32 and u64.
Natural alignment
=================
The rule mentioned above forms what we refer to as natural alignment:
When accessing N bytes of memory, the base memory address must be evenly
divisible by N, i.e. addr % N == 0.
When writing code, assume the target architecture has natural alignment
requirements.
In reality, only a few architectures require natural alignment on all sizes
of memory access. However, we must consider ALL supported architectures;
writing code that satisfies natural alignment requirements is the easiest way
to achieve full portability.
Why unaligned access is bad
===========================
The effects of performing an unaligned memory access vary from architecture
to architecture. It would be easy to write a whole document on the differences
here; a summary of the common scenarios is presented below:
- Some architectures are able to perform unaligned memory accesses
transparently, but there is usually a significant performance cost.
- Some architectures raise processor exceptions when unaligned accesses
happen. The exception handler is able to correct the unaligned access,
at significant cost to performance.
- Some architectures raise processor exceptions when unaligned accesses
happen, but the exceptions do not contain enough information for the
unaligned access to be corrected.
- Some architectures are not capable of unaligned memory access, but will
silently perform a different memory access to the one that was requested,
resulting in a subtle code bug that is hard to detect!
It should be obvious from the above that if your code causes unaligned
memory accesses to happen, your code will not work correctly on certain
platforms and will cause performance problems on others.
Code that does not cause unaligned access
=========================================
At first, the concepts above may seem a little hard to relate to actual
coding practice. After all, you don't have a great deal of control over
memory addresses of certain variables, etc.
Fortunately things are not too complex, as in most cases, the compiler
ensures that things will work for you. For example, take the following
structure::
struct foo {
u16 field1;
u32 field2;
u8 field3;
};
Let us assume that an instance of the above structure resides in memory
starting at address 0x10000. With a basic level of understanding, it would
not be unreasonable to expect that accessing field2 would cause an unaligned
access. You'd be expecting field2 to be located at offset 2 bytes into the
structure, i.e. address 0x10002, but that address is not evenly divisible
by 4 (remember, we're reading a 4 byte value here).
Fortunately, the compiler understands the alignment constraints, so in the
above case it would insert 2 bytes of padding in between field1 and field2.
Therefore, for standard structure types you can always rely on the compiler
to pad structures so that accesses to fields are suitably aligned (assuming
you do not cast the field to a type of different length).
Similarly, you can also rely on the compiler to align variables and function
parameters to a naturally aligned scheme, based on the size of the type of
the variable.
At this point, it should be clear that accessing a single byte (u8 or char)
will never cause an unaligned access, because all memory addresses are evenly
divisible by one.
On a related topic, with the above considerations in mind you may observe
that you could reorder the fields in the structure in order to place fields
where padding would otherwise be inserted, and hence reduce the overall
resident memory size of structure instances. The optimal layout of the
above example is::
struct foo {
u32 field2;
u16 field1;
u8 field3;
};
For a natural alignment scheme, the compiler would only have to add a single
byte of padding at the end of the structure. This padding is added in order
to satisfy alignment constraints for arrays of these structures.
Another point worth mentioning is the use of __attribute__((packed)) on a
structure type. This GCC-specific attribute tells the compiler never to
insert any padding within structures, useful when you want to use a C struct
to represent some data that comes in a fixed arrangement 'off the wire'.
You might be inclined to believe that usage of this attribute can easily
lead to unaligned accesses when accessing fields that do not satisfy
architectural alignment requirements. However, again, the compiler is aware
of the alignment constraints and will generate extra instructions to perform
the memory access in a way that does not cause unaligned access. Of course,
the extra instructions obviously cause a loss in performance compared to the
non-packed case, so the packed attribute should only be used when avoiding
structure padding is of importance.
Code that causes unaligned access
=================================
With the above in mind, let's move onto a real life example of a function
that can cause an unaligned memory access. The following function taken
from include/linux/etherdevice.h is an optimized routine to compare two
ethernet MAC addresses for equality::
bool ether_addr_equal(const u8 *addr1, const u8 *addr2)
{
#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
u32 fold = ((*(const u32 *)addr1) ^ (*(const u32 *)addr2)) |
((*(const u16 *)(addr1 + 4)) ^ (*(const u16 *)(addr2 + 4)));
return fold == 0;
#else
const u16 *a = (const u16 *)addr1;
const u16 *b = (const u16 *)addr2;
return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) == 0;
#endif
}
In the above function, when the hardware has efficient unaligned access
capability, there is no issue with this code. But when the hardware isn't
able to access memory on arbitrary boundaries, the reference to a[0] causes
2 bytes (16 bits) to be read from memory starting at address addr1.
Think about what would happen if addr1 was an odd address such as 0x10003.
(Hint: it'd be an unaligned access.)
Despite the potential unaligned access problems with the above function, it
is included in the kernel anyway but is understood to only work normally on
16-bit-aligned addresses. It is up to the caller to ensure this alignment or
not use this function at all. This alignment-unsafe function is still useful
as it is a decent optimization for the cases when you can ensure alignment,
which is true almost all of the time in ethernet networking context.
Here is another example of some code that could cause unaligned accesses::
void myfunc(u8 *data, u32 value)
{
[...]
*((u32 *) data) = cpu_to_le32(value);
[...]
}
This code will cause unaligned accesses every time the data parameter points
to an address that is not evenly divisible by 4.
In summary, the 2 main scenarios where you may run into unaligned access
problems involve:
1. Casting variables to types of different lengths
2. Pointer arithmetic followed by access to at least 2 bytes of data
Avoiding unaligned accesses
===========================
The easiest way to avoid unaligned access is to use the get_unaligned() and
put_unaligned() macros provided by the <asm/unaligned.h> header file.
Going back to an earlier example of code that potentially causes unaligned
access::
void myfunc(u8 *data, u32 value)
{
[...]
*((u32 *) data) = cpu_to_le32(value);
[...]
}
To avoid the unaligned memory access, you would rewrite it as follows::
void myfunc(u8 *data, u32 value)
{
[...]
value = cpu_to_le32(value);
put_unaligned(value, (u32 *) data);
[...]
}
The get_unaligned() macro works similarly. Assuming 'data' is a pointer to
memory and you wish to avoid unaligned access, its usage is as follows::
u32 value = get_unaligned((u32 *) data);
These macros work for memory accesses of any length (not just 32 bits as
in the examples above). Be aware that when compared to standard access of
aligned memory, using these macros to access unaligned memory can be costly in
terms of performance.
If use of such macros is not convenient, another option is to use memcpy(),
where the source or destination (or both) are of type u8* or unsigned char*.
Due to the byte-wise nature of this operation, unaligned accesses are avoided.
Alignment vs. Networking
========================
On architectures that require aligned loads, networking requires that the IP
header is aligned on a four-byte boundary to optimise the IP stack. For
regular ethernet hardware, the constant NET_IP_ALIGN is used. On most
architectures this constant has the value 2 because the normal ethernet
header is 14 bytes long, so in order to get proper alignment one needs to
DMA to an address which can be expressed as 4*n + 2. One notable exception
here is powerpc which defines NET_IP_ALIGN to 0 because DMA to unaligned
addresses can be very expensive and dwarf the cost of unaligned loads.
For some ethernet hardware that cannot DMA to unaligned addresses like
4*n+2 or non-ethernet hardware, this can be a problem, and it is then
required to copy the incoming frame into an aligned buffer. Because this is
unnecessary on architectures that can do unaligned accesses, the code can be
made dependent on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS like so::
#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
skb = original skb
#else
skb = copy skb
#endif