-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlvWyDMACgkQnJ2qBz9k
QNnifgf+PXybPXX3KxtRUmK4u2zX2JMTwzuE0wmLxM6I08tf7rzLrBIbOY7iXka/
nzW6IK+KnA5HtPTEUbxqNBAvWpUAvPLZ/v20d0t/QTMJcz8yfhpvM9O2mjQAGMH8
EBmjjEhZaso8uOIAPhUg9um1QdQoYWa329fsoQuHor9kjKmDg+3RmtdH0jbRzQ6B
RNAY1WNFbm+7MH7Fu3AB/jLqqkwZhoPcu7TwXP6m+va6xAvzEYUOQQB9rPEIaY2Z
+q0B9LhwFIAnWPCI7dxw3CBTndoR2u1vkpnGw5FFhJgnMG4L1QMPoCCYPIZEIXg/
VuGZQ0/mayCtO+JWw+VDJF3jQFrHxA==
=J6tx
-----END PGP SIGNATURE-----
Merge tag 'for_v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"Amir's patches to implement superblock fanotify watches, Xiaoming's
patch to enable reporting of thread IDs in fanotify events instead of
TGIDs (sadly the patch got mis-attributed to Amir and I've noticed
only now), and a fix of possible oops on umount caused by fsnotify
infrastructure"
* tag 'for_v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: Fix busy inodes during unmount
fs: group frequently accessed fields of struct super_block together
fanotify: support reporting thread id instead of process id
fanotify: add BUILD_BUG_ON() to count the bits of fanotify constants
fsnotify: convert runtime BUG_ON() to BUILD_BUG_ON()
fanotify: deprecate uapi FAN_ALL_* constants
fanotify: simplify handling of FAN_ONDIR
fsnotify: generalize handling of extra event flags
fanotify: fix collision of internal and uapi mark flags
fanotify: store fanotify_init() flags in group's fanotify_data
fanotify: add API to attach/detach super block mark
fsnotify: send path type events to group with super block marks
fsnotify: add super block object type
Pull XArray conversion from Matthew Wilcox:
"The XArray provides an improved interface to the radix tree data
structure, providing locking as part of the API, specifying GFP flags
at allocation time, eliminating preloading, less re-walking the tree,
more efficient iterations and not exposing RCU-protected pointers to
its users.
This patch set
1. Introduces the XArray implementation
2. Converts the pagecache to use it
3. Converts memremap to use it
The page cache is the most complex and important user of the radix
tree, so converting it was most important. Converting the memremap
code removes the only other user of the multiorder code, which allows
us to remove the radix tree code that supported it.
I have 40+ followup patches to convert many other users of the radix
tree over to the XArray, but I'd like to get this part in first. The
other conversions haven't been in linux-next and aren't suitable for
applying yet, but you can see them in the xarray-conv branch if you're
interested"
* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
radix tree: Remove multiorder support
radix tree test: Convert multiorder tests to XArray
radix tree tests: Convert item_delete_rcu to XArray
radix tree tests: Convert item_kill_tree to XArray
radix tree tests: Move item_insert_order
radix tree test suite: Remove multiorder benchmarking
radix tree test suite: Remove __item_insert
memremap: Convert to XArray
xarray: Add range store functionality
xarray: Move multiorder_check to in-kernel tests
xarray: Move multiorder_shrink to kernel tests
xarray: Move multiorder account test in-kernel
radix tree test suite: Convert iteration test to XArray
radix tree test suite: Convert tag_tagged_items to XArray
radix tree: Remove radix_tree_clear_tags
radix tree: Remove radix_tree_maybe_preload_order
radix tree: Remove split/join code
radix tree: Remove radix_tree_update_node_t
page cache: Finish XArray conversion
dax: Convert page fault handlers to XArray
...
Detaching of mark connector from fsnotify_put_mark() can race with
unmounting of the filesystem like:
CPU1 CPU2
fsnotify_put_mark()
spin_lock(&conn->lock);
...
inode = fsnotify_detach_connector_from_object(conn)
spin_unlock(&conn->lock);
generic_shutdown_super()
fsnotify_unmount_inodes()
sees connector detached for inode
-> nothing to do
evict_inode()
barfs on pending inode reference
iput(inode);
Resulting in "Busy inodes after unmount" message and possible kernel
oops. Make fsnotify_unmount_inodes() properly wait for outstanding inode
references from detached connectors.
Note that the accounting of outstanding inode references in the
superblock can cause some cacheline contention on the counter. OTOH it
happens only during deletion of the last notification mark from an inode
(or during unlinking of watched inode) and that is not too bad. I have
measured time to create & delete inotify watch 100000 times from 64
processes in parallel (each process having its own inotify group and its
own file on a shared superblock) on a 64 CPU machine. Average and
standard deviation of 15 runs look like:
Avg Stddev
Vanilla 9.817400 0.276165
Fixed 9.710467 0.228294
So there's no statistically significant difference.
Fixes: 6b3f05d24d ("fsnotify: Detach mark from object list when last reference is dropped")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Change i_pages from a radix_tree_root to an xarray, convert the
documentation into kernel-doc format and change the order of the elements
to pack them better on 64-bit systems.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Kernel test robot reported [1] a 6% performance regression in a
concurrent unlink(2) workload on commit 60f7ed8c7c ("fsnotify: send
path type events to group with super block marks").
The performance test was run with no fsnotify marks at all on the
data set, so the only extra instructions added by the offending
commit are tests of the super_block fields s_fsnotify_{marks,mask}
and these tests happen on almost every single inode access.
When adding those fields to the super_block struct, we did not give much
thought of placing them on a hot cache lines (we just placed them at the
end of the struct).
Re-organize struct super_block to try and keep some frequently accessed
fields on the same cache line.
Move the frequently accessed fields s_fsnotify_{marks,mask} near the
frequently accessed fields s_fs_info,s_time_gran, while filling a 64bit
alignment hole after s_time_gran.
Move the seldom accessed fields s_id,s_uuid,s_max_links,s_mode near the
seldom accessed fields s_vfs_rename_mutex,s_subtype.
Rong Chen confirmed that this patch solved the reported problem.
[1] https://lkml.org/lkml/2018/9/30/206
Reported-by: kernel test robot <rong.a.chen@intel.com>
Tested-by: kernel test robot <rong.a.chen@intel.com>
Fixes: 1e6cb72399 ("fsnotify: add super block object type")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Commit 031a072a0b ("vfs: call vfs_clone_file_range() under freeze
protection") created a wrapper do_clone_file_range() around
vfs_clone_file_range() moving the freeze protection to former, so
overlayfs could call the latter.
The more common vfs practice is to call do_xxx helpers from vfs_xxx
helpers, where freeze protecction is taken in the vfs_xxx helper, so
this anomality could be a source of confusion.
It seems that commit 8ede205541 ("ovl: add reflink/copyfile/dedup
support") may have fallen a victim to this confusion -
ovl_clone_file_range() calls the vfs_clone_file_range() helper in the
hope of getting freeze protection on upper fs, but in fact results in
overlayfs allowing to bypass upper fs freeze protection.
Swap the names of the two helpers to conform to common vfs practice
and call the correct helpers from overlayfs and nfsd.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add the infrastructure to attach a mark to a super_block struct
and detach all attached marks when super block is destroyed.
This is going to be used by fanotify backend to setup super block
marks.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
This is going to be used by overlayfs and possibly useful
for other filesystems.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag. The purpose
is to make data spoofing attacks harder. This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection. This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.
This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:
CVE-2000-1134
CVE-2007-3852
CVE-2008-0525
CVE-2009-0416
CVE-2011-4834
CVE-2015-1838
CVE-2015-7442
CVE-2016-7489
This list is not meant to be complete. It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector. In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.
[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Solar Designer <solar@openwall.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This contains two new features:
1) Stack file operations: this allows removal of several hacks from the
VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
2) Metadata only copy-up: when file is on lower layer and only metadata is
modified (except size) then only copy up the metadata and continue to
use the data from the lower file.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3srhAAKCRDh3BK/laaZ
PC6tAQCP+KklcN+TvNp502f+O/kATahSpgnun4NY1/p4I8JV+AEAzdlkTN3+MiAO
fn9brN6mBK7h59DO3hqedPLJy2vrgwg=
=QDXH
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"This contains two new features:
- Stack file operations: this allows removal of several hacks from
the VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
- Metadata only copy-up: when file is on lower layer and only
metadata is modified (except size) then only copy up the metadata
and continue to use the data from the lower file"
* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
ovl: Enable metadata only feature
ovl: Do not do metacopy only for ioctl modifying file attr
ovl: Do not do metadata only copy-up for truncate operation
ovl: add helper to force data copy-up
ovl: Check redirect on index as well
ovl: Set redirect on upper inode when it is linked
ovl: Set redirect on metacopy files upon rename
ovl: Do not set dentry type ORIGIN for broken hardlinks
ovl: Add an inode flag OVL_CONST_INO
ovl: Treat metacopy dentries as type OVL_PATH_MERGE
ovl: Check redirects for metacopy files
ovl: Move some dir related ovl_lookup_single() code in else block
ovl: Do not expose metacopy only dentry from d_real()
ovl: Open file with data except for the case of fsync
ovl: Add helper ovl_inode_realdata()
ovl: Store lower data inode in ovl_inode
ovl: Fix ovl_getattr() to get number of blocks from lower
ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
ovl: Copy up meta inode data from lowest data inode
ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
...
a_ops->readpages() is only ever used for read-ahead, yet we don't flag
the IO being submitted as such. Fix that up. Any file system that uses
mpage_readpages() as its ->readpages() implementation will now get this
right.
Since we're passing in whether the IO is read-ahead or not, we don't
need to pass in the 'gfp' separately, as it is dependent on the IO being
read-ahead. Kill off that member.
Add some documentation notes on ->readpages() being purely for
read-ahead.
Link: http://lkml.kernel.org/r/20180621010725.17813-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This flag was introduce in 2.1.37pre1 and the only place it was tested
was removed in 2.1.43pre1. The flag was never set.
Let's discard it properly.
Link: http://lkml.kernel.org/r/877en0hewz.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull misc vfs updates from Al Viro:
"Misc cleanups from various folks all over the place
I expected more fs/dcache.c cleanups this cycle, so that went into a
separate branch. Said cleanups have missed the window, so in the
hindsight it could've gone into work.misc instead. Decided not to
cherry-pick, thus the 'work.dcache' branch"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: dcache: Use true and false for boolean values
fold generic_readlink() into its only caller
fs: shave 8 bytes off of struct inode
fs: Add more kernel-doc to the produced documentation
fs: Fix attr.c kernel-doc
removed extra extern file_fdatawait_range
* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kill dentry_update_name_case()
Pull vfs icache updates from Al Viro:
- NFS mkdir/open_by_handle race fix
- analogous solution for FUSE, replacing the one currently in mainline
- new primitive to be used when discarding halfway set up inodes on
failed object creation; gives sane warranties re icache lookups not
returning such doomed by still not freed inodes. A bunch of
filesystems switched to that animal.
- Miklos' fix for last cycle regression in iget5_locked(); -stable will
need a slightly different variant, unfortunately.
- misc bits and pieces around things icache-related (in adfs and jfs).
* 'work.mkdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
jfs: don't bother with make_bad_inode() in ialloc()
adfs: don't put inodes into icache
new helper: inode_fake_hash()
vfs: don't evict uninitialized inode
jfs: switch to discard_new_inode()
ext2: make sure that partially set up inodes won't be returned by ext2_iget()
udf: switch to discard_new_inode()
ufs: switch to discard_new_inode()
btrfs: switch to discard_new_inode()
new primitive: discard_new_inode()
kill d_instantiate_no_diralias()
nfs_instantiate(): prevent multiple aliases for directory inode
Pull vfs open-related updates from Al Viro:
- "do we need fput() or put_filp()" rules are gone - it's always fput()
now. We keep track of that state where it belongs - in ->f_mode.
- int *opened mess killed - in finish_open(), in ->atomic_open()
instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().
- alloc_file() wrappers with saner calling conventions are introduced
(alloc_file_clone() and alloc_file_pseudo()); callers converted, with
much simplification.
- while we are at it, saner calling conventions for path_init() and
link_path_walk(), simplifying things inside fs/namei.c (both on
open-related paths and elsewhere).
* 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
few more cleanups of link_path_walk() callers
allow link_path_walk() to take ERR_PTR()
make path_init() unconditionally paired with terminate_walk()
document alloc_file() changes
make alloc_file() static
do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
new helper: alloc_file_clone()
create_pipe_files(): switch the first allocation to alloc_file_pseudo()
anon_inode_getfile(): switch to alloc_file_pseudo()
hugetlb_file_setup(): switch to alloc_file_pseudo()
ocxlflash_getfile(): switch to alloc_file_pseudo()
cxl_getfile(): switch to alloc_file_pseudo()
... and switch shmem_file_setup() to alloc_file_pseudo()
__shmem_file_setup(): reorder allocations
new wrapper: alloc_file_pseudo()
kill FILE_{CREATED,OPENED}
switch atomic_open() and lookup_open() to returning 0 in all success cases
document ->atomic_open() changes
->atomic_open(): return 0 in all success cases
get rid of 'opened' in path_openat() and the helpers downstream
...
We don't want open-by-handle picking half-set-up in-core
struct inode from e.g. mkdir() having failed halfway through.
In other words, we don't want such inodes returned by iget_locked()
on their way to extinction. However, we can't just have them
unhashed - otherwise open-by-handle immediately *after* that would've
ended up creating a new in-core inode over the on-disk one that
is in process of being freed right under us.
Solution: new flag (I_CREATING) set by insert_inode_locked() and
removed by unlock_new_inode() and a new primitive (discard_new_inode())
to be used by such halfway-through-setup failure exits instead of
unlock_new_inode() / iput() combinations. That primitive unlocks new
inode, but leaves I_CREATING in place.
iget_locked() treats finding an I_CREATING inode as failure
(-ESTALE, once we sort out the error propagation).
insert_inode_locked() treats the same as instant -EBUSY.
ilookup() treats those as icache miss.
[Fix by Dan Carpenter <dan.carpenter@oracle.com> folded in]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull vfs fixes from Al Viro:
"Fix several places that screw up cleanups after failures halfway
through opening a file (one open-coding filp_clone_open() and getting
it wrong, two misusing alloc_file()). That part is -stable fodder from
the 'work.open' branch.
And Christoph's regression fix for uapi breakage in aio series;
include/uapi/linux/aio_abi.h shouldn't be pulling in the kernel
definition of sigset_t, the reason for doing so in the first place had
been bogus - there's no need to expose struct __aio_sigset in
aio_abi.h at all"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: don't expose __aio_sigset in uapi
ocxlflash_getfile(): fix double-iput() on alloc_file() failures
cxl_getfile(): fix double-iput() on alloc_file() failures
drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()
Opening regular files on overlayfs is now handled via ovl_open(). Remove
the now unused "open_flags" argument from d_op->d_real() and the d_real()
helper.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This partially reverts commit c568d68341.
Overlayfs files will now automatically get the correct locks, no need to
hack overlay support in VFS.
It is a partial revert, because it leaves the locks_inode() calls in place
and defines locks_inode() to file_inode(). We could revert those as well,
but it would be unnecessary code churn and it makes sense to document that
we are getting the inode for locking purposes.
Don't revert MS_NOREMOTELOCK yet since that has been part of the userspace
API for some time (though not in a useful way). Will try to remove
internal flags later when the dust around the new mount API settles.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Stacking file operations in overlay will store an extra open file for each
overlay file opened.
The overhead is just that of "struct file" which is about 256bytes, because
overlay already pins an extra dentry and inode when the file is open, which
add up to a much larger overhead.
For fear of breaking working setups, don't start accounting the extra file.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
open a file by given inode, faking ->f_path. Use with shitloads
of caution - at the very least you'd damn better make sure that
some dentry alias of that inode is pinned down by the path in
question. Again, this is no general-purpose interface and I hope
it will eventually go away. Right now overlayfs wants something
like that, but nothing else should.
Any out-of-tree code with bright idea of using this one *will*
eventually get hurt, with zero notice and great delight on my part.
I refuse to use EXPORT_SYMBOL_GPL(), especially in situations when
it's really EXPORT_SYMBOL_DONT_USE_IT(), but don't take that export
as "you are welcome to use it".
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Parallel to FILE_CREATED, goes into ->f_mode instead of *opened.
NFS is a bit of a wart here - it doesn't have file at the point
where FILE_CREATED used to be set, so we need to propagate it
there (for now). IMA is another one (here and everywhere)...
Note that this needs do_dentry_open() to leave old bits in ->f_mode
alone - we want it to preserve FMODE_CREATED if it had been already
set (no other bit can be there).
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
basically, "is that instance set up enough for regular fput(), or
do we want put_filp() for that one".
NOTE: the only alloc_file() caller that could be followed by put_filp()
is in arch/ia64/kernel/perfmon.c, which is (Kconfig-level) broken.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Failure of ->open() should *not* be followed by fput(). Fixed by
using filp_clone_open(), which gets the cleanups right.
Cc: stable@vger.kernel.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's exactly the same thing as
dentry_open(&file->f_path, file->f_flags, file->f_cred)
... and rename it to file_clone_open(), while we are at it.
'filp' naming convention is bogus; sure, it's "file pointer",
but we generally don't do that kind of Hungarian notation.
Some of the instances have too many callers to touch, but this
one has only two, so let's sanitize it while we can...
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Here is a link to Linus' reply to Jan's concern about making
i_blkbibts byte addressable:
https://marc.info/?l=linux-fsdevel&m=152882624707975&w=2
Here is a link to an lkp.org report about potential performance
improvement in some workload, which could(?) be related to packing
i_blkbits closer to i_bytes/i_lock:
https://marc.info/?l=linux-fsdevel&m=153077048108198&w=2
Changes since v1:
- Add links to relevant discussions
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Clean up f_op->dedupe_file_range() interface.
1) Use loff_t for offsets and length instead of u64
2) Order the arguments the same way as {copy|clone}_file_range().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
The poll() changes were not well thought out, and completely
unexplained. They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.
Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead. That gets rid of one of the new indirections.
But that doesn't fix the new complexity that is completely unwarranted
for the regular case. The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.
[ This revert is a revert of about 30 different commits, not reverted
individually because that would just be unnecessarily messy - Linus ]
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jeff added this extern twice in commit a823e4589e
Fixes: a823e4589e ("mm: add file_fdatawait_range and file_write_and_wait")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
There were no conflicts between this and the contents of linux-next
until just before the merge window, when we saw multiple problems:
- A minor conflict with my own y2038 fixes, which I could address
by adding another patch on top here.
- One semantic conflict with late changes to the NFS tree. I addressed
this by merging Deepa's original branch on top of the changes that
now got merged into mainline and making sure the merge commit includes
the necessary changes as produced by coccinelle.
- A trivial conflict against the removal of staging/lustre.
- Multiple conflicts against the VFS changes in the overlayfs tree.
These are still part of linux-next, but apparently this is no longer
intended for 4.18 [1], so I am ignoring that part.
As Deepa writes:
The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64 timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual
replacement becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions.
Thomas Gleixner adds:
I think there is no point to drag that out for the next merge window.
The whole thing needs to be done in one go for the core changes which
means that you're going to play that catchup game forever. Let's get
over with it towards the end of the merge window.
[1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
A3HkjIBrKW5AgQDxfgvm
=CZX2
-----END PGP SIGNATURE-----
Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground
Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
"This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
As Deepa writes:
'The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64
timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual replacement
becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions'
Thomas Gleixner adds:
'I think there is no point to drag that out for the next merge
window. The whole thing needs to be done in one go for the core
changes which means that you're going to play that catchup game
forever. Let's get over with it towards the end of the merge window'"
* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
pstore: Remove bogus format string definition
vfs: change inode times to use struct timespec64
pstore: Convert internal records to timespec64
udf: Simplify calls to udf_disk_stamp_to_time
fs: nfs: get rid of memcpys for inode times
ceph: make inode time prints to be long long
lustre: Use long long type to print inode time
fs: add timespec64_truncate()
Pull aio iopriority support from Al Viro:
"The rest of aio stuff for this cycle - Adam's aio ioprio series"
* 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: aio ioprio use ioprio_check_cap ret val
fs: aio ioprio add explicit block layer dependence
fs: iomap dio set bio prio from kiocb prio
fs: blkdev set bio prio from kiocb prio
fs: Add aio iopriority support
fs: Convert kiocb rw_hint from enum to u16
block: add ioprio_check_cap function
This contains a fix for the vfs_mkdir() issue discovered by Al, as well as
other fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCWxatHQAKCRDh3BK/laaZ
POg5AP95a/uUOrTJeTsENJwTmyAwHed9a6y4abKtvNErxUm4awD9FmhyYXodzJNq
9/mheT4kV2XkR/KkxI5sizfT1uPuvgA=
=+ljQ
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"This contains a fix for the vfs_mkdir() issue discovered by Al, as
well as other fixes and cleanups"
* tag 'ovl-fixes-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: use inode_insert5() to hash a newly created inode
ovl: Pass argument to ovl_get_inode() in a structure
vfs: factor out inode_insert5()
ovl: clean up copy-up error paths
ovl: return EIO on internal error
ovl: make ovl_create_real() cope with vfs_mkdir() safely
ovl: create helper ovl_create_temp()
ovl: return dentry from ovl_create_real()
ovl: struct cattr cleanups
ovl: strip debug argument from ovl_do_ helpers
ovl: remove WARN_ON() real inode attributes mismatch
ovl: Kconfig documentation fixes
ovl: update documentation for unionmount-testsuite
algorithms. Yes, Speck is contrversial, but the intention is to use
them only for the lowest end Android devices, where the alternative
*really* is no encryption at all for data stored at rest.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlsW/RAACgkQ8vlZVpUN
gaO1pwf/WOusoXBK5sUuiC8d9I5s+OlPhTKhrh+BcL7/xhOkyh2xDv2FEwsjhwUf
qo26AMf7DsWKWgJ6wDQ1z+PIuPSNeQy5dCKbz2hbfNjET3vdk2NuvPWnIbFrmIek
LB6Ii9jKlPJRO4T3nMrE9JzJZLsoX5OKRSgYTT3EviuW/wSXaFyi7onFnyKXBnF/
e689tE50P42PgTEDKs4RDw43PwEGbcl5Vtj+Lnoh6VGX/pYvL/9ZbEYlKrgqSOU4
DmckR8D8UU/Gy6G5bvMsVuJpLEU7vBxupOOHI/nJFwR6tuYi0Q1j7C/zH8BvWp5e
o8P5GpOWk7Gm346FaUlkAZ+25bCU+A==
=EBeE
-----END PGP SIGNATURE-----
Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
Pull fscrypt updates from Ted Ts'o:
"Add bunch of cleanups, and add support for the Speck128/256
algorithms.
Yes, Speck is contrversial, but the intention is to use them only for
the lowest end Android devices, where the alternative *really* is no
encryption at all for data stored at rest"
* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
fscrypt: log the crypto algorithm implementations
fscrypt: add Speck128/256 support
fscrypt: only derive the needed portion of the key
fscrypt: separate key lookup from key derivation
fscrypt: use a common logging function
fscrypt: remove internal key size constants
fscrypt: remove unnecessary check for non-logon key type
fscrypt: make fscrypt_operations.max_namelen an integer
fscrypt: drop empty name check from fname_decrypt()
fscrypt: drop max_namelen check from fname_decrypt()
fscrypt: don't special-case EOPNOTSUPP from fscrypt_get_encryption_info()
fscrypt: don't clear flags on crypto transform
fscrypt: remove stale comment from fscrypt_d_revalidate()
fscrypt: remove error messages for skcipher_request_alloc() failure
fscrypt: remove unnecessary NULL check when allocating skcipher
fscrypt: clean up after fscrypt_prepare_lookup() conversions
fs, fscrypt: only define ->s_cop when FS_ENCRYPTION is enabled
fscrypt: use unbound workqueue for decryption
file systems.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlsWmXcACgkQ8vlZVpUN
gaMGmAf+JGK4XysAvlJuj9tJfFPHHwgXSBBe/GAgyjhW9XhtNRHprUM2SpvwpIdI
Isl5O8Ec+FywBJB0I4AGy6yds6DE6jn38FFRFEhVmkp4EoROJiIr8+a7spfVuC3m
cWrHBgc7FwK4qYlyuGtH2+6NYva+KNFr+wwbvvUusvldyZAWMzflfrcdHM6D+/JE
+Sd5I7aniqnP5fICq0b4xrP2zWO4XJEKMbZO2dJ9yRtMmUnbaSj6G+bTGDRyfrNk
L3wJhqIu93o7zjDaEC0UfXSLAXzoDGWHeq7fBssaJiXj/hNtAvAGPaRMbgFR9a3h
uHmhvf84iyJuM+8GgG25UqeGwCuWiA==
=b0VQ
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"A lot of cleanups and bug fixes, especially dealing with corrupted
file systems"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: fix fencepost error in check for inode count overflow during resize
ext4: correctly handle a zero-length xattr with a non-zero e_value_offs
ext4: bubble errors from ext4_find_inline_data_nolock() up to ext4_iget()
ext4: do not allow external inodes for inline data
ext4: report delalloc reserve as non-free in statfs for project quota
ext4: remove NULL check before calling kmem_cache_destroy()
jbd2: remove NULL check before calling kmem_cache_destroy()
jbd2: remove bunch of empty lines with jbd2 debug
ext4: handle errors on ext4_commit_super
ext4: do not update s_last_mounted of a frozen fs
ext4: factor out helper ext4_sample_last_mounted()
vfs: add the sb_start_intwrite_trylock() helper
ext4: update mtime in ext4_punch_hole even if no blocks are released
ext4: add verifier check for symlink with append/immutable flags
fs: ext4: add new return type vm_fault_t
ext4: fix hole length detection in ext4_ind_map_blocks()
ext4: mark block bitmap corrupted when found
ext4: mark inode bitmap corrupted when found
ext4: add new ext4_mark_group_bitmap_corrupted() helper
ext4: fix wrong return value in ext4_read_inode_bitmap()
...
Pull aio updates from Al Viro:
"Majority of AIO stuff this cycle. aio-fsync and aio-poll, mostly.
The only thing I'm holding back for a day or so is Adam's aio ioprio -
his last-minute fixup is trivial (missing stub in !CONFIG_BLOCK case),
but let it sit in -next for decency sake..."
* 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
aio: sanitize the limit checking in io_submit(2)
aio: fold do_io_submit() into callers
aio: shift copyin of iocb into io_submit_one()
aio_read_events_ring(): make a bit more readable
aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way
aio: take list removal to (some) callers of aio_complete()
aio: add missing break for the IOCB_CMD_FDSYNC case
random: convert to ->poll_mask
timerfd: convert to ->poll_mask
eventfd: switch to ->poll_mask
pipe: convert to ->poll_mask
crypto: af_alg: convert to ->poll_mask
net/rxrpc: convert to ->poll_mask
net/iucv: convert to ->poll_mask
net/phonet: convert to ->poll_mask
net/nfc: convert to ->poll_mask
net/caif: convert to ->poll_mask
net/bluetooth: convert to ->poll_mask
net/sctp: convert to ->poll_mask
net/tipc: convert to ->poll_mask
...
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbFTxwAAoJEAAOaEEZVoIVfOQQAIMmcJjx8c+86TS5bVpi9yWN
z6C2QEW1vMN2Z65dvjquSbF81yY54oCgiV3/6iGOtcOOI3rH/7f+CKb+4xRskpFv
PSVXKXkXKOGRnnhcz2X8F+FBbd4xLxzyB+Ff9HCWVv2L76d7Uu8wUcbSpOdC7GOd
pXHQkA+WNPEnHpr4To09ZwS14IzVCAhFii7MvOU6IXxzbP473/aPpFjrwa+mbkkz
J7U4bijaLL5eGJGE0ElHOaeF/iAhOBEFGrkT3lR78ZA+spnqVXlv2U9E29zg9H98
JyFM7vMe9MtR///Ve6BeLNKrDnwCmBo9ScMR8eWUWhrOInGX9yU0UR9XkD5TzA3U
hQHW2ckjloykp+HuvbUOF/Aut8GczbyaDsAHZ6/fAxgNGnFeJ7yjR5IPKFkKekPt
H+ls9j4C9yZpTLXioNqJz6d2IOS5MKfjPFC0c2ItO11jzNdbmjO7ciIq4EecSR51
FKeoUHyjhfak94Dh7x3IQEtnm1o5CbbSNvQ8g6PJYXFBi88jcmQkLmZlULDV3vyg
9febfSYpJhamJ+GUn8av+HziJG/1o1sOLrIVi2qGRIBJiUO5tAYY6UVvo1xKuzQ3
BMSRVnqVbs749lUV0QrYoFVdMvVxL6l0+8p4Xgx7q5BR4gG73e4yaBnXLeStMIYo
oSFy/EH3MYR87m5Yk45E
=rqcZ
-----END PGP SIGNATURE-----
Merge tag 'locks-v4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull fasync fix from Jeff Layton:
"Just a single fix for a deadlock in the fasync handling code that
Kirill observed while testing.
The fix is to change the fa_lock to be rwlock_t, and use a read lock
in kill_fasync_rcu"
* tag 'locks-v4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fasync: Fix deadlock between task-context and interrupt-context kill_fasync()
Pull misc vfs updates from Al Viro:
"Misc bits and pieces not fitting into anything more specific"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: delete unnecessary assignment in vfs_listxattr
Documentation: filesystems: update filesystem locking documentation
vfs: namei: use path_equal() in follow_dotdot()
fs.h: fix outdated comment about file flags
__inode_security_revalidate() never gets NULL opt_dentry
make xattr_getsecurity() static
vfat: simplify checks in vfat_lookup()
get rid of dead code in d_find_alias()
it's SB_BORN, not MS_BORN...
msdos_rmdir(): kill BS comment
remove rpc_rmdir()
fs: avoid fdput() after failed fdget() in vfs_dedupe_file_range()
This is the per-I/O equivalent of the ioprio_set system call.
When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.
This patch depends on block: add ioprio_check_cap function.
Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>