mirror of
https://github.com/Fishwaldo/Star64_linux.git
synced 2025-04-23 14:54:03 +00:00
Pach series "Add FOLL_LONGTERM to GUP fast and use it". HFI1, qib, and mthca, use get_user_pages_fast() due to its performance advantages. These pages can be held for a significant time. But get_user_pages_fast() does not protect against mapping FS DAX pages. Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which retains the performance while also adding the FS DAX checks. XDP has also shown interest in using this functionality.[1] In addition we change get_user_pages() to use the new FOLL_LONGTERM flag and remove the specialized get_user_pages_longterm call. [1] https://lkml.org/lkml/2019/3/19/939 "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Secondly, it depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an aside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. This patch (of 7): This patch starts a series which aims to support FOLL_LONGTERM in get_user_pages_fast(). Some callers who would like to do a longterm (user controlled pin) of pages with the fast variant of GUP for performance purposes. Rather than have a separate get_user_pages_longterm() call, introduce FOLL_LONGTERM and change the longterm callers to use it. This patch does not change any functionality. In the short term "longterm" or user controlled pins are unsafe for Filesystems and FS DAX in particular has been blocked. However, callers of get_user_pages_fast() were not "protected". FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it requires vmas to determine if DAX is in use. NOTE: In merging with the CMA changes we opt to change the get_user_pages() call in check_and_migrate_cma_pages() to a call of __get_user_pages_locked() on the newly migrated pages. This makes the code read better in that we are calling __get_user_pages_locked() on the pages before and after a potential migration. As a side affect some of the interfaces are cleaned up but this is not the primary purpose of the series. In review[1] it was asked: <quote> > This I don't get - if you do lock down long term mappings performance > of the actual get_user_pages call shouldn't matter to start with. > > What do I miss? A couple of points. First "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Second, It depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an asside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. </quote> [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965 [ira.weiny@intel.com: v3] Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rich Felker <dalias@libc.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: James Hogan <jhogan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
143 lines
4.4 KiB
C
143 lines
4.4 KiB
C
/*
|
|
* Copyright (c) 2006, 2007, 2008, 2009 QLogic Corporation. All rights reserved.
|
|
* Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
|
|
*
|
|
* This software is available to you under a choice of one of two
|
|
* licenses. You may choose to be licensed under the terms of the GNU
|
|
* General Public License (GPL) Version 2, available from the file
|
|
* COPYING in the main directory of this source tree, or the
|
|
* OpenIB.org BSD license below:
|
|
*
|
|
* Redistribution and use in source and binary forms, with or
|
|
* without modification, are permitted provided that the following
|
|
* conditions are met:
|
|
*
|
|
* - Redistributions of source code must retain the above
|
|
* copyright notice, this list of conditions and the following
|
|
* disclaimer.
|
|
*
|
|
* - Redistributions in binary form must reproduce the above
|
|
* copyright notice, this list of conditions and the following
|
|
* disclaimer in the documentation and/or other materials
|
|
* provided with the distribution.
|
|
*
|
|
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
|
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
|
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
|
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
|
|
* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
|
|
* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
|
* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
* SOFTWARE.
|
|
*/
|
|
|
|
#include <linux/mm.h>
|
|
#include <linux/sched/signal.h>
|
|
#include <linux/device.h>
|
|
|
|
#include "qib.h"
|
|
|
|
static void __qib_release_user_pages(struct page **p, size_t num_pages,
|
|
int dirty)
|
|
{
|
|
size_t i;
|
|
|
|
for (i = 0; i < num_pages; i++) {
|
|
if (dirty)
|
|
set_page_dirty_lock(p[i]);
|
|
put_page(p[i]);
|
|
}
|
|
}
|
|
|
|
/**
|
|
* qib_map_page - a safety wrapper around pci_map_page()
|
|
*
|
|
* A dma_addr of all 0's is interpreted by the chip as "disabled".
|
|
* Unfortunately, it can also be a valid dma_addr returned on some
|
|
* architectures.
|
|
*
|
|
* The powerpc iommu assigns dma_addrs in ascending order, so we don't
|
|
* have to bother with retries or mapping a dummy page to insure we
|
|
* don't just get the same mapping again.
|
|
*
|
|
* I'm sure we won't be so lucky with other iommu's, so FIXME.
|
|
*/
|
|
int qib_map_page(struct pci_dev *hwdev, struct page *page, dma_addr_t *daddr)
|
|
{
|
|
dma_addr_t phys;
|
|
|
|
phys = pci_map_page(hwdev, page, 0, PAGE_SIZE, PCI_DMA_FROMDEVICE);
|
|
if (pci_dma_mapping_error(hwdev, phys))
|
|
return -ENOMEM;
|
|
|
|
if (!phys) {
|
|
pci_unmap_page(hwdev, phys, PAGE_SIZE, PCI_DMA_FROMDEVICE);
|
|
phys = pci_map_page(hwdev, page, 0, PAGE_SIZE,
|
|
PCI_DMA_FROMDEVICE);
|
|
if (pci_dma_mapping_error(hwdev, phys))
|
|
return -ENOMEM;
|
|
/*
|
|
* FIXME: If we get 0 again, we should keep this page,
|
|
* map another, then free the 0 page.
|
|
*/
|
|
}
|
|
*daddr = phys;
|
|
return 0;
|
|
}
|
|
|
|
/**
|
|
* qib_get_user_pages - lock user pages into memory
|
|
* @start_page: the start page
|
|
* @num_pages: the number of pages
|
|
* @p: the output page structures
|
|
*
|
|
* This function takes a given start page (page aligned user virtual
|
|
* address) and pins it and the following specified number of pages. For
|
|
* now, num_pages is always 1, but that will probably change at some point
|
|
* (because caller is doing expected sends on a single virtually contiguous
|
|
* buffer, so we can do all pages at once).
|
|
*/
|
|
int qib_get_user_pages(unsigned long start_page, size_t num_pages,
|
|
struct page **p)
|
|
{
|
|
unsigned long locked, lock_limit;
|
|
size_t got;
|
|
int ret;
|
|
|
|
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
|
|
locked = atomic64_add_return(num_pages, ¤t->mm->pinned_vm);
|
|
|
|
if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
|
|
ret = -ENOMEM;
|
|
goto bail;
|
|
}
|
|
|
|
down_read(¤t->mm->mmap_sem);
|
|
for (got = 0; got < num_pages; got += ret) {
|
|
ret = get_user_pages(start_page + got * PAGE_SIZE,
|
|
num_pages - got,
|
|
FOLL_LONGTERM | FOLL_WRITE | FOLL_FORCE,
|
|
p + got, NULL);
|
|
if (ret < 0) {
|
|
up_read(¤t->mm->mmap_sem);
|
|
goto bail_release;
|
|
}
|
|
}
|
|
up_read(¤t->mm->mmap_sem);
|
|
|
|
return 0;
|
|
bail_release:
|
|
__qib_release_user_pages(p, got, 0);
|
|
bail:
|
|
atomic64_sub(num_pages, ¤t->mm->pinned_vm);
|
|
return ret;
|
|
}
|
|
|
|
void qib_release_user_pages(struct page **p, size_t num_pages)
|
|
{
|
|
__qib_release_user_pages(p, num_pages, 1);
|
|
|
|
/* during close after signal, mm can be NULL */
|
|
if (current->mm)
|
|
atomic64_sub(num_pages, ¤t->mm->pinned_vm);
|
|
}
|