PCIe link up fails with few legacy endpoints if root port advertises both
Gen-1 and Gen-2 speeds in Tegra. This is because link number negotiation
fails if both Gen1 & Gen2 are advertised. Tegra doesn't retry link up by
advertising only Gen1. Hence, the strategy followed here is to initially
advertise only Gen-1 and after link is up, retrain link to Gen-2 speed.
Tegra doesn't support HW autonomous speed change. Link comes up in Gen1
even if Gen2 is advertised, so there is no downside of this change.
This behavior is observed with following two PCIe devices on Tegra:
- Fusion HDTV 5 Express card
- IOGear SIL - PCIE - SATA card
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Recommended UpdateFC threshold in Tegra210 is 0x60 for best performance
of x1 link. Setting this to 0x60 provides the best balance between number
of UpdateFC packets and read data sent over the link.
UpdateFC timer frequency is equal to twice the value of register content
in nsec, i.e (2 * 0x60) = 192 nsec.
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
The logic which blocks read requests till AFI gets ACK for all outstanding
writes from memory controller does not behave correctly when number of
outstanding writes become more than 32 in Tegra124 and Tegra132.
SW fixup is to prevent writes from accumulating more than 32 by:
- limiting outstanding posted writes to 14
- modifying Gen1 and Gen2 UpdateFC timer frequency
UpdateFC timer frequency is equal to twice the value of register content
in nsec. These settings are recommended after stress testing with
different values.
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Sometimes link speed change from Gen2 to Gen1 fails due to instability
in deskew logic on lane-0 in Tegra210. Increase the deskew retry time
to resolve this issue.
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Enable xclk clock clamping when entering L1. Clamp threshold will
determine the time spent waiting for clock module to turn on xclk after
signaling it. Default threshold value in Tegra124 and Tegra210 is not
enough to turn on xclk clock. Increase the clamp threshold to meet the
clock module timing in Tegra124 and Tegra210. Default threshold value is
enough in Tegra20, Tegra30 and Tegra186.
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
PM message are truncated while entering L1 or L2, which is resulting in
receiver errors. Set the required bit to finish processing DLLP before
link enter L1 or L2.
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Outstanding write counter in AFI is used to generate idle signal to
dynamically gate the AFI clock. When there are 32 outstanding writes
from AFI to memory, the outstanding write counter overflows and
indicates that there are "0" outstanding write transactions.
When memory controller is under heavy load, write completions to AFI
gets delayed and AFI write counter overflows. This causes AFI clock gating
even when there are outstanding transactions towards memory controller
resulting in a system hang.
Disable dynamic clock gating of AFI clock to avoid system hang.
CLKEN_OVERRIDE bit is not defined in Tegra20 and Tegra30, however
programming this bit doesn't cause any side effects. Program this
bit for all Tegra SoCs to avoid conditional check.
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Enable opportunistic UpdateFC and ACK to allow data link layer send
pending ACKs and UpdateFC packets when link is idle instead of waiting
for timers to expire. This improves the PCIe performance due to better
utilization of PCIe bandwidth.
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
UPHY electrical programming guidelines are documented in Tegra210 TRM.
Program these electrical settings for proper eye diagram in Gen1 and Gen2
link speeds.
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Default root port setting hides AER capability. This patch enables the
advertisement of AER capability by root port.
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Tegra124, Tegra132, Tegra210 and Tegra186 support Gen2 link speed. After
PCIe link is up in Gen1, set target link speed as Gen2 and retrain link.
Link switches to Gen2 speed if Gen2 capable end point is connected,
otherwise the link stays in Gen1.
Per PCIe 4.0r0.9 sec 7.6.3.7 implementation note, driver needs to wait for
PCIe LTSSM to come back from recovery before retraining the link.
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
The PCIe host power up sequence requires to program AFI(AXI to FPCI
bridge) registers first and then PCIe registers, otherwise AFI register
settings may not latch to PCIe IP.
PCIe root port starts LTSSM as soon as PCIe xrst is deasserted.
So deassert PCIe xrst after programming PCIe registers.
Modify PCIe power up sequence as follows:
- Power ungate PCIe partition
- Enable AFI clock
- Deassert AFI reset
- Program AFI registers
- Enable PCIe clock
- Deassert PCIe reset
- Program PCIe PHY
- Program PCIe pad control registers
- Program PCIe root port registers
- Deassert PCIe xrst to start LTSSM
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Tegra PCIe has register specifications for:
- AXI to FPCI(AFI) bridge
- Multiple PCIe root ports
- PCIe PHY
- PCIe pad control
Rearrange Tegra PCIe driver functions so that each function programs
the required module only.
- tegra_pcie_enable_controller(): Program AFI module and enable PCIe
controller
- tegra_pcie_phy_power_on(): Bring up PCIe PHY
- tegra_pcie_apply_pad_settings(): Program PCIe REFCLK pad settings
- tegra_pcie_enable_ports(): Program each root port and bring up PCIe
link
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Unroll the PCIe power on sequence if any one of the steps fails in
tegra_pcie_power_on().
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Presently, there is no path to DMA map P2PDMA memory, so if a TLP targeting
this memory hits the root complex and an IOMMU is present, the IOMMU will
reject the transaction, even if the RC would support P2PDMA.
So until the kernel knows to map these DMA addresses in the IOMMU, we
should not enable the whitelist when an IOMMU is present.
Link: https://lore.kernel.org/linux-pci/20190522201252.2997-1-logang@deltatee.com/
Fixes: 0f97da8310 ("PCI/P2PDMA: Allow P2P DMA between any devices under AMD ZEN Root Complex")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Prevent PCI bridges in general (and PCIe ports in particular)
from being put into low-power states during system-wide suspend
transitions if there are any devices in D0 below them and refine
the handling of PCI devices in D0 during suspend-to-idle cycles.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl0KAdUSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxejIQAIoR8FCLoKcxD4wJ6sDp5CtGVaw65pc9
i0WaTGlQWiBcr3bkxCRERl+NNjolVrUu7aAVrrUNe5SUQduFXZuGsreF0q3SPMUh
OZwSb+EpN6gSM3GTjrsF2P9nyvlJ80r5t9HI6vG1hAEFBU8T15gGVS6bnwm4ci7I
+KuIb4zwkOQ+LCwjqwkGjn6s4ZHmx2KxGnI58GBTAd4KsvV3G7QIaa7Lfa/js88C
pDhz8BiQqs/HTU0gHY52hsEvhKPeefMKH3QDpBFhoR0p1ZOkMoqK1jA+Kc5r/JF+
36Fj/rPlD26pmqYMZA7bZi4Ij0M+vR8SWCdefcvzPqUZpzkHh9y7/foi01DVNjsf
QGhlJODgGUl78mjEQwdPXz/ntzj4DEyo/3Re9Xf/SZ09sMeoyhbNi5Qolri05LqV
8hAshCNcFLbOzF1emcAa+Yq76tggWnW78q3oKAsfUqg4Olyvcbxy6J3GDRpzTwPz
D/4lEM7jtSqbcgprqWUcANB/zE3Jw93et0QtQNUfdOJ6a+LsS2XAhqenkQn2JQpk
7ZjaVfNNm3YDQlKt6nPaWCCxVv/g6KHSYXDeWB5VJpCOrSVhdXAZgPU+UCGrk9TU
3TqdqFoKi0LVZJVuWT+oyNfwzfolGZ7gd7TJVndFxVM8kbcLTzrj1ZQgpwP1l/tI
Xs12WM7cw1dy
=n3GG
-----END PGP SIGNATURE-----
Merge tag 'pm-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Prevent PCI bridges in general (and PCIe ports in particular) from
being put into low-power states during system-wide suspend transitions
if there are any devices in D0 below them and refine the handling of
PCI devices in D0 during suspend-to-idle cycles"
* tag 'pm-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PCI: PM: Skip devices in D0 for suspend-to-idle
PME polling does not take into account that a device that is directly
connected to the host bridge may go into D3cold as well. This leads to a
situation where the PME poll thread reads from a config space of a
device that is in D3cold and gets incorrect information because the
config space is not accessible.
Here is an example from Intel Ice Lake system where two PCIe root ports
are in D3cold (I've instrumented the kernel to log the PMCSR register
contents):
[ 62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff
[ 62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff
Since 0xffff is interpreted so that PME is pending, the root ports will
be runtime resumed. This repeats over and over again essentially
blocking all runtime power management.
Prevent this from happening by checking whether the device is in D3cold
before its PME status is read.
Fixes: 71a83bd727 ("PCI/PM: add runtime PM support to PCIe port")
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Cc: 3.6+ <stable@vger.kernel.org> # v3.6+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently Linux does not follow PCIe spec regarding the required delays
after reset. A concrete example is a Thunderbolt add-in-card that
consists of a PCIe switch and two PCIe endpoints:
+-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
+-01.0-[04-36]-- DS hotplug port
+-02.0-[37]----00.0 xHCI controller
\-04.0-[38-6b]-- DS hotplug port
The root port (1b.0) and the PCIe switch downstream ports are all PCIe
gen3 so they support 8GT/s link speeds.
We wait for the PCIe hierarchy to enter D3cold (runtime):
pcieport 0000:00:1b.0: power state changed by ACPI to D3cold
When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the
PCIe switch is put to reset and its power is re-applied. This means that
we must follow the rules in PCIe 4.0 section 6.6.1.
For the PCIe gen3 ports we are dealing with here, the following applies:
With a Downstream Port that supports Link speeds greater than 5.0
GT/s, software must wait a minimum of 100 ms after Link training
completes before sending a Configuration Request to the device
immediately below that Port. Software can determine when Link training
completes by polling the Data Link Layer Link Active bit or by setting
up an associated interrupt (see Section 6.7.3.3).
Translating this into the above topology we would need to do this (DLLLA
stands for Data Link Layer Link Active):
pcieport 0000:00:1b.0: wait for 100ms after DLLLA is set before access to 0000:01:00.0
pcieport 0000:02:00.0: wait for 100ms after DLLLA is set before access to 0000:03:00.0
pcieport 0000:02:02.0: wait for 100ms after DLLLA is set before access to 0000:37:00.0
I've instrumented the kernel with additional logging so we can see the
actual delays the kernel performs:
pcieport 0000:00:1b.0: power state changed by ACPI to D0
pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms
pcieport 0000:00:1b.0: waking up bus
pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms
pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
...
pcieport 0000:00:1b.0: PME# disabled
pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
...
pcieport 0000:01:00.0: PME# disabled
pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
...
pcieport 0000:02:00.0: PME# disabled
pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
...
pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
pcieport 0000:02:01.0: PME# disabled
pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
...
pcieport 0000:02:02.0: PME# disabled
pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
...
pcieport 0000:02:04.0: PME# disabled
pcieport 0000:02:01.0: PME# enabled
pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms
pcieport 0000:02:04.0: PME# enabled
pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms
thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000)
...
thunderbolt 0000:03:00.0: PME# disabled
xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000)
...
xhci_hcd 0000:37:00.0: PME# disabled
For the switch upstream port (01:00.0) we wait for 100ms but not taking
into account the DLLLA requirement. We then wait 10ms for D3hot -> D0
transition of the root port and the two downstream hotplug ports. This
means that we deviate from what the spec requires.
Performing the same check for system sleep (s2idle) transitions we can
see following when resuming from s2idle:
pcieport 0000:00:1b.0: power state changed by ACPI to D0
pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
...
pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
...
pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
pcieport 0000:02:02.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
pcieport 0000:02:02.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
pcieport 0000:02:02.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1)
pcieport 0000:02:01.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
pcieport 0000:02:02.0: restoring config space at offset 0x20 (was 0x0, writing 0x73f073f0)
pcieport 0000:02:04.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
pcieport 0000:02:01.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
pcieport 0000:02:00.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
pcieport 0000:02:02.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1)
pcieport 0000:02:04.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
pcieport 0000:02:01.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1ff10001)
pcieport 0000:02:00.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
pcieport 0000:02:02.0: restoring config space at offset 0x18 (was 0x0, writing 0x373702)
pcieport 0000:02:04.0: restoring config space at offset 0x24 (was 0x10001, writing 0x49f12001)
pcieport 0000:02:01.0: restoring config space at offset 0x20 (was 0x0, writing 0x73e05c00)
pcieport 0000:02:00.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1)
pcieport 0000:02:04.0: restoring config space at offset 0x20 (was 0x0, writing 0x89f07400)
pcieport 0000:02:01.0: restoring config space at offset 0x1c (was 0x101, writing 0x5151)
pcieport 0000:02:00.0: restoring config space at offset 0x20 (was 0x0, writing 0x8a008a00)
pcieport 0000:02:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
pcieport 0000:02:04.0: restoring config space at offset 0x1c (was 0x101, writing 0x6161)
pcieport 0000:02:01.0: restoring config space at offset 0x18 (was 0x0, writing 0x360402)
pcieport 0000:02:00.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1)
pcieport 0000:02:04.0: restoring config space at offset 0x18 (was 0x0, writing 0x6b3802)
pcieport 0000:02:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
pcieport 0000:02:00.0: restoring config space at offset 0x18 (was 0x0, writing 0x30302)
pcieport 0000:02:01.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
pcieport 0000:02:04.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
pcieport 0000:02:00.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
pcieport 0000:02:04.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
pcieport 0000:02:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000)
...
thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000)
This is even worse. None of the mandatory delays are performed. If this
would be S3 instead of s2idle then according to PCI FW spec 3.2 section
4.6.8. there is a specific _DSM that allows the OS to skip the delays
but this platform does not provide the _DSM and does not go to S3 anyway
so no firmware is involved that could already handle these delays.
In this particular Intel Coffee Lake platform these delays are not
actually needed because there is an additional delay as part of the ACPI
power resource that is used to turn on power to the hierarchy but since
that additional delay is not required by any of standards (PCIe, ACPI)
it is not present in the Intel Ice Lake, for example where missing the
mandatory delays causes pciehp to start tearing down the stack too early
(links are not yet trained).
For this reason, change the PCIe portdrv PM resume hooks so that they
perform the mandatory delays before the downstream component gets
resumed. We perform the delays before port services are resumed because
otherwise pciehp might find that the link is not up (even if it is just
training) and tears-down the hierarchy.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Stratix 10 PCIe controller does not support Type 1 to Type 0 conversion
as previous version (V1) does so the PCIe controller configuration
mechanism needs to send Type 0 config TLP if the target bus number
matches with the secondary bus number.
Implement a function to form a TLP header that depends on the PCIe
controller version, so that the header can be formed according to
specific host controller HW internals, fixing the type conversion issue.
Signed-off-by: Ley Foon Tan <ley.foon.tan@intel.com>
[lorenzo.pieralisi@arm.com: commit log]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Bring PHY support for the Armada8k driver.
The Armada8k IP only supports x1, x2 or x4 link widths. Iterate over
the DT 'phys' entries and configure them one by one. Use
phy_set_mode_ext() to make use of the submode parameter (initially
introduced for Ethernet modes). For PCI configuration, let the submode
be the width (1, 2, 4, etc) so that the PHY driver knows how many
lanes are bundled. Do not error out in case of error for compatibility
reasons.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
The code in pci_dev_keep_suspended() is relatively hard to follow due
to the negative checks in it and in its callers and the function has
a possible side-effect (disabling the PME) which doesn't really match
its role.
For this reason, move the PME disabling from pci_dev_keep_suspended()
to a separate function and change the semantics (and name) of the
rest of it, so that 'true' is returned when the device needs to be
resumed (and not the other way around). Change the callers of
pci_dev_keep_suspended() accordingly.
While at it, make the code flow in pci_pm_poweroff() reflect the
pci_pm_suspend() more closely to avoid arbitrary differences between
them.
This is a cosmetic change with no intention to alter behavior.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
The current code resumes devices in D3hot during system suspend if
the target power state for them is D3cold, but that is not necessary
in general. It only is necessary to do that if the platform firmware
requires the device to be resumed, but that should be covered by
the platform_pci_need_resume() check anyway, so rework
pci_dev_keep_suspended() to avoid returning 'false' for devices
in D3hot which need not be resumed due to platform firmware
requirements.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Logan noticed that devm_memremap_pages_release() kills the percpu_ref
drops all the page references that were acquired at init and then
immediately proceeds to unplug, arch_remove_memory(), the backing pages
for the pagemap. If for some reason device shutdown actually collides
with a busy / elevated-ref-count page then arch_remove_memory() should
be deferred until after that reference is dropped.
As it stands the "wait for last page ref drop" happens *after*
devm_memremap_pages_release() returns, which is obviously too late and
can lead to crashes.
Fix this situation by assigning the responsibility to wait for the
percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup()
callback. Implement the new cleanup callback for all
devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma.
Link: http://lkml.kernel.org/r/155727339156.292046.5432007428235387859.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 41e94a8513 ("add devm_memremap_pages")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for fixing a race between devm_memremap_pages_release()
and the final put of a page from the device-page-map, allocate a
percpu-ref per p2pdma resource mapping.
Link: http://lkml.kernel.org/r/155727338646.292046.9922678317501435597.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pci_p2pdma_add_resource() implementation immediately frees the pgmap
if gen_pool_add_virt() fails. However, that means that when @dev
triggers a devres release devm_memremap_pages_release() will crash
trying to access the freed @pgmap.
Use the new devm_memunmap_pages() to manually free the mapping in the
error path.
Link: http://lkml.kernel.org/r/155727337603.292046.13101332703665246702.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Fixes: 52916982af ("PCI/P2PDMA: Support peer-to-peer memory")
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit d491f2b752 ("PCI: PM: Avoid possible suspend-to-idle issue")
attempted to avoid a problem with devices whose drivers want them to
stay in D0 over suspend-to-idle and resume, but it did not go as far
as it should with that.
Namely, first of all, the power state of a PCI bridge with a
downstream device in D0 must be D0 (based on the PCI PM spec r1.2,
sec 6, table 6-1, if the bridge is not in D0, there can be no PCI
transactions on its secondary bus), but that is not actively enforced
during system-wide PM transitions, so use the skip_bus_pm flag
introduced by commit d491f2b752 for that.
Second, the configuration of devices left in D0 (whatever the reason)
during suspend-to-idle need not be changed and attempting to put them
into D0 again by force is pointless, so explicitly avoid doing that.
Fixes: d491f2b752 ("PCI: PM: Avoid possible suspend-to-idle issue")
Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
PCIe r5.0, sec 7.5.3.18, defines a new 32.0 GT/s bit in the Supported Link
Speeds Vector of Link Capabilities 2. Decode this new speed. This does
not affect the speed of the link, which should be negotiated automatically
by the hardware; it only adds decoding when showing the speed to the user.
Previously, reading the speed of a link operating at this speed showed
"Unknown speed" instead of "32.0 GT/s".
Link: https://lore.kernel.org/lkml/92365e3caf0fc559f9ab14bcd053bfc92d4f661c.1559664969.git.gustavo.pimentel@synopsys.com
Signed-off-by: Gustavo Pimentel <gustavo.pimentel@synopsys.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Commit 0e7df22401 ("PCI: Add sysfs sriov_drivers_autoprobe to control
VF driver binding") introduced the sriov_drivers_autoprobe attribute
which allows users to prevent the kernel from automatically probing a
driver for new VFs as they are created. This allows VFs to be spawned
without automatically binding the new device to a host driver, such as
in cases where the user intends to use the device only with a meta
driver like vfio-pci. However, the current implementation prevents any
use of drivers_probe with the VF while sriov_drivers_autoprobe=0. This
blocks the now current general practice of setting driver_override
followed by using drivers_probe to bind a device to a specified driver.
The kernel never automatically sets a driver_override therefore it seems
we can assume a driver_override reflects the intent of the user. Also,
probing a device using a driver_override match seems outside the scope
of the 'auto' part of sriov_drivers_autoprobe. Therefore, let's allow
driver_override matches regardless of sriov_drivers_autoprobe, which we
can do by simply testing if a driver_override is set for a device as a
'can probe' condition.
Fixes: 0e7df22401 ("PCI: Add sysfs sriov_drivers_autoprobe to control VF driver binding")
Link: https://lore.kernel.org/lkml/155742996741.21878.569845487290798703.stgit@gimli.home
Link: https://lore.kernel.org/linux-pci/155672991496.20698.4279330795743262888.stgit@gimli.home/T/#u
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
The NVIDIA Turing GPU is a multi-function PCI device with the following
functions:
- Function 0: VGA display controller
- Function 1: Audio controller
- Function 2: USB xHCI Host controller
- Function 3: USB Type-C UCSI controller
Function 0 is tightly coupled with other functions in the hardware. When
function 0 is in D3, it gates power for hardware blocks used by other
functions, which means those functions only work when function 0 is in D0.
If any of these functions (1/2/3) are in D0, then function 0 should also be
in D0.
Commit 07f4f97d7b ("vga_switcheroo: Use device link for HDA controller")
already creates a device link to show the dependency of function 1 on
function 0 of this GPU. Create additional device links to express the
dependencies of functions 2 and 3 on function 0. This means function 0
will be in D0 if any other function is in D0.
[bhelgaas: I think the PCI spec expectation is that functions can be
power-managed independently, so I don't think this device is technically
compliant. For example, the PCIe r5.0 spec, sec 1.4, says "the PCI/PCIe
hardware/software model includes architectural constructs necessary to
discover, configure, and use a Function, without needing Function-specific
knowledge" and sec 5.1 says "D states are associated with a particular
Function" and "PM provides ... a mechanism to identify power management
capabilities of a given Function [and] the ability to transition a Function
into a certain power management state."]
Link: https://lore.kernel.org/lkml/20190606092225.17960-3-abhsahu@nvidia.com
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Although not allowed by the PCI specs, some multi-function devices have
power dependencies between the functions. For example, function 1 may not
work unless function 0 is in the D0 power state.
The existing quirk_gpu_hda() adds a device link to express this dependency
for GPU and HDA devices, but it really is not specific to those device
types.
Generalize it and rename it to pci_create_device_link() so we can create
dependencies between any "consumer" and "producer" functions of a
multi-function device, where the consumer is only functional if the
producer is in D0. This reorganization should not affect any
functionality.
Link: https://lore.kernel.org/lkml/20190606092225.17960-2-abhsahu@nvidia.com
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
[bhelgaas: commit log, reword diagnostic]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Seeing the we want to use more interrupts in the NTB MSI code
we need to be able allocate more (sometimes virtual) interrupts
in the switchtec driver. Therefore add a module parameter to
request to allocate additional interrupts.
This puts virtually no limit on the number of MSI interrupts available
to NTB clients.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
For NTB devices, we want to be able to trigger MSI interrupts
through a memory window. In these cases we may want to use
more interrupts than the NTB PCI device has available in its MSI-X
table.
We allow for this by creating a new 'virtual' interrupt. These
interrupts are allocated as usual but are not programmed into the
MSI-X table (as there may not be space for them).
The MSI address and data will then handled through an NTB MSI library
introduced later in this series.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Associated pci_epf_bar structure is needed in pci_epc_clear_bar() to
clear a BAR correctly but it is reset in pci_epf_free_space() (that
is called first) which results in pci_epc_clear_bar() failure.
Reorder the pci_epc_clear_bar()/pci_epf_free_space() calls execution
to fix the issue.
Signed-off-by: Alan Mikhak <alan.mikhak@sifive.com>
[lorenzo.pieralisi@arm.com: reworded the commit log]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Kishon Vijay Abraham I <kishon@ti.com>
Always skip odd BAR when skipping 64bit BARs in pci_epf_test_set_bar()
and pci_epf_test_alloc_space() otherwise pci_epf_test_set_bar() will
call pci_epc_set_bar() on an odd loop index when skipping reserved 64bit
BAR.
Moreover, pci_epf_test_alloc_space() will call pci_epf_alloc_space() on
bind for an odd loop index when BAR is 64bit but leaks on subsequent
unbind by not calling pci_epf_free_space().
Signed-off-by: Alan Mikhak <alan.mikhak@sifive.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Kishon Vijay Abraham I <kishon@ti.com>
Reviewed-by: Paul Walmsley <paul.walmsley@sifive.com>
PCI endpoint test function code should honor the .bar_fixed_size parameter
from underlying endpoint controller drivers or results may be unexpected.
In pci_epf_test_alloc_space(), check if BAR being used for test
register space is a fixed size BAR. If so, allocate the required fixed
size.
Signed-off-by: Alan Mikhak <alan.mikhak@sifive.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Kishon Vijay Abraham I <kishon@ti.com>
Set endpoint controller pointer to NULL in pci_epc_remove_epf()
to avoid -EBUSY on subsequent call to pci_epc_add_epf().
Add a check for NULL endpoint function pointer.
Signed-off-by: Alan Mikhak <alan.mikhak@sifive.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Kishon Vijay Abraham I <kishon@ti.com>
For PCI devices that have an OF node, set the fwnode as well. This way
drivers that rely on fwnode don't need the special case described by
commit f94277af03 ("of/platform: Initialise dev->fwnode appropriately").
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Currently, there is only a 1 ms sleep after asserting PERST.
Reading the datasheets for different endpoints, some require PERST to be
asserted for 10 ms in order for the endpoint to perform a reset, others
require it to be asserted for 50 ms.
Several SoCs using this driver uses PCIe Mini Card, where we don't know
what endpoint will be plugged in.
The PCI Express Card Electromechanical Specification r2.0, section
2.2, "PERST# Signal" specifies:
"On power up, the deassertion of PERST# is delayed 100 ms (TPVPERL) from
the power rails achieving specified operating limits."
Add a sleep of 100 ms before deasserting PERST, in order to ensure that
we are compliant with the spec.
Fixes: 82a823833f ("PCI: qcom: Add Qualcomm PCIe controller driver")
Signed-off-by: Niklas Cassel <niklas.cassel@linaro.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Stanimir Varbanov <svarbanov@mm-sol.com>
Cc: stable@vger.kernel.org # 4.5+
Altera MSI IP is a soft IP and is only available after
an FPGA image (with design containing it) is programmed.
Make driver modulable to support use case FPGA image is programmed the
after kernel has booted, so that the driver can be loaded upon request.
Signed-off-by: Ley Foon Tan <ley.foon.tan@intel.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Altera PCIe Rootport IP is a soft IP and is only available after
an FPGA image (whose design contains it) is programmed.
Make driver modulable to support use cases when FPGA image is
programmed after the kernel has booted, so that the driver
can be loaded upon request.
Signed-off-by: Ley Foon Tan <ley.foon.tan@intel.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Commit 0e7df22401 ("PCI: Add sysfs sriov_drivers_autoprobe to control
VF driver binding") allows the user to specify that drivers for VFs of
a PF should not be probed, but it actually causes pci_device_probe() to
return success back to the driver core in this case. Therefore by all
sysfs appearances the device is bound to a driver, the driver link from
the device exists as does the device link back from the driver, yet the
driver's probe function is never called on the device. We also fail to
do any sort of cleanup when we're prohibited from probing the device,
the IRQ setup remains in place and we even hold a device reference.
Instead, abort with errno before any setup or references are taken when
pci_device_can_probe() prevents us from trying to probe the device.
Link: https://lore.kernel.org/lkml/155672991496.20698.4279330795743262888.stgit@gimli.home
Fixes: 0e7df22401 ("PCI: Add sysfs sriov_drivers_autoprobe to control VF driver binding")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
The QCS404 platform contains a PCIe version 2.4.0 controller and a
Qualcomm PCIe2 PHY. The driver already supports version 2.4.0, for the
IPQ4019, but this support touches clocks and resets related to the PHY
as well and there's no upstream driver for the PHY.
On QCS404 we must initialize the PHY, so a separate PHY driver is
implemented to take care of this and the controller driver is updated to
not require the PHY related resources. This is done by relying on the
fact that operations in both the clock and reset framework are NOPs when
passed NULL, so we can isolate this change to only the
qcom_pcie_get_resources_2_4_0() function.
For QCS404 we also need to enable the AHB (iface) clock, in order to
access the register space of the controller, but as this is not part of
the IPQ4019 DT binding this is only added for new users of the 2.4.0
controller.
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Niklas Cassel <niklas.cassel@linaro.org>
Reviewed-by: Vinod Koul <vkoul@kernel.org>
Acked-by: Stanimir Varbanov <svarbanov@mm-sol.com>
Before introducing the QCS404 platform, which uses the same PCIe
controller as IPQ4019, migrate this to use the bulk clock API, in order
to make the error paths slighly cleaner.
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Niklas Cassel <niklas.cassel@linaro.org>
Reviewed-by: Vinod Koul <vkoul@kernel.org>
Acked-by: Stanimir Varbanov <svarbanov@mm-sol.com>
If a PCI driver leaves the device handled by it in D0 and calls
pci_save_state() on the device in its ->suspend() or ->suspend_late()
callback, it can expect the device to stay in D0 over the whole
s2idle cycle. However, that may not be the case if there is a
spurious wakeup while the system is suspended, because in that case
pci_pm_suspend_noirq() will run again after pci_pm_resume_noirq()
which calls pci_restore_state(), via pci_pm_default_resume_early(),
so state_saved is cleared and the second iteration of
pci_pm_suspend_noirq() will invoke pci_prepare_to_sleep() which
may change the power state of the device.
To avoid that, add a new internal flag, skip_bus_pm, that will be set
by pci_pm_suspend_noirq() when it runs for the first time during the
given system suspend-resume cycle if the state of the device has
been saved already and the device is still in D0. Setting that flag
will cause the next iterations of pci_pm_suspend_noirq() to set
state_saved for pci_pm_resume_noirq(), so that it always restores the
device state from the originally saved data, and avoid calling
pci_prepare_to_sleep() for the device.
Fixes: 33e4f80ee6 ("ACPI / PM: Ignore spurious SCI wakeups from suspend-to-idle")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Both acpi_pci_need_resume() and acpi_dev_needs_resume() check if the
current ACPI wakeup configuration of the device matches what is
expected as far as system wakeup from sleep states is concerned, as
reflected by the device_may_wakeup() return value for the device.
However, they only should do that if wakeup.flags.valid is set for
the device's ACPI companion, because otherwise the wakeup.prepare_count
value for it is meaningless.
Add the missing wakeup.flags.valid checks to these functions.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
* pci/printk:
PCI: Replace dev_printk(KERN_DEBUG) with dev_info(), etc
PCI: Replace printk(KERN_INFO) with pr_info(), etc
PCI: Use dev_printk() when possible