This is the start of the stable review cycle for the 5.4.142 release. There are 62 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Wed, 18 Aug 2021 12:54:12 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.142-rc1... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y and the diffstat can be found below.
thanks,
greg k-h
------------- Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 5.4.142-rc1
Saeed Mirzamohammadi saeed.mirzamohammadi@oracle.com iommu/vt-d: Fix agaw for a supported 48 bit guest address width
Nathan Chancellor nathan@kernel.org vmlinux.lds.h: Handle clang's module.{c,d}tor sections
Jeff Layton jlayton@kernel.org ceph: take snap_empty_lock atomically with snaprealm refcount change
Jeff Layton jlayton@kernel.org ceph: clean up locking annotation for ceph_get_snap_realm and __lookup_snap_realm
Jeff Layton jlayton@kernel.org ceph: add some lockdep assertions around snaprealm handling
Sean Christopherson seanjc@google.com KVM: VMX: Use current VMCS to query WAITPKG support for MSR emulation
Thomas Gleixner tglx@linutronix.de PCI/MSI: Protect msi_desc::masked for multi-MSI
Thomas Gleixner tglx@linutronix.de PCI/MSI: Use msi_mask_irq() in pci_msi_shutdown()
Thomas Gleixner tglx@linutronix.de PCI/MSI: Correct misleading comments
Thomas Gleixner tglx@linutronix.de PCI/MSI: Do not set invalid bits in MSI mask
Thomas Gleixner tglx@linutronix.de PCI/MSI: Enforce MSI[X] entry updates to be visible
Thomas Gleixner tglx@linutronix.de PCI/MSI: Enforce that MSI-X table entry is masked for update
Thomas Gleixner tglx@linutronix.de PCI/MSI: Mask all unused MSI-X entries
Thomas Gleixner tglx@linutronix.de PCI/MSI: Enable and mask MSI-X early
Ben Dai ben.dai@unisoc.com genirq/timings: Prevent potential array overflow in __irq_timings_store()
Bixuan Cui cuibixuan@huawei.com genirq/msi: Ensure deactivation on teardown
Babu Moger Babu.Moger@amd.com x86/resctrl: Fix default monitoring groups reporting
Thomas Gleixner tglx@linutronix.de x86/ioapic: Force affinity setup before startup
Thomas Gleixner tglx@linutronix.de x86/msi: Force affinity setup before startup
Thomas Gleixner tglx@linutronix.de genirq: Provide IRQCHIP_AFFINITY_PRE_STARTUP
Randy Dunlap rdunlap@infradead.org x86/tools: Fix objdump version check again
Pu Lehui pulehui@huawei.com powerpc/kprobes: Fix kprobe Oops happens in booke
Xie Yongji xieyongji@bytedance.com nbd: Aovid double completion of a request
Longpeng(Mike) longpeng2@huawei.com vsock/virtio: avoid potential deadlock when vsock device remove
Maximilian Heyne mheyne@amazon.de xen/events: Fix race in set_evtchn_to_irq
Eric Dumazet edumazet@google.com net: igmp: increase size of mr_ifc_count
Neal Cardwell ncardwell@google.com tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets
Willy Tarreau w@1wt.eu net: linkwatch: fix failure to restore device state across suspend/resume
Yang Yingliang yangyingliang@huawei.com net: bridge: fix memleak in br_add_if()
Vladimir Oltean vladimir.oltean@nxp.com net: dsa: sja1105: fix broken backpressure in .port_fdb_dump
Vladimir Oltean vladimir.oltean@nxp.com net: dsa: lantiq: fix broken backpressure in .port_fdb_dump
Vladimir Oltean vladimir.oltean@nxp.com net: dsa: lan9303: fix broken backpressure in .port_fdb_dump
Eric Dumazet edumazet@google.com net: igmp: fix data-race in igmp_ifc_timer_expire()
Takeshi Misawa jeliantsurux@gmail.com net: Fix memory leak in ieee802154_raw_deliver
Ben Hutchings ben.hutchings@mind.be net: dsa: microchip: Fix ksz_read64()
Christian Hewitt christianshewitt@gmail.com drm/meson: fix colour distortion from HDR set during vendor u-boot
Aya Levin ayal@nvidia.com net/mlx5: Fix return value from tracer initialization
Roi Dayan roid@nvidia.com psample: Add a fwd declaration for skbuff
Md Fahad Iqbal Polash md.fahad.iqbal.polash@intel.com iavf: Set RSS LUT and key in reset handle path
Hangbin Liu liuhangbin@gmail.com net: sched: act_mirred: Reset ct info when mirror/redirect skb
Pali Rohár pali@kernel.org ppp: Fix generating ifname when empty IFLA_IFNAME is specified
Ben Hutchings ben.hutchings@mind.be net: phy: micrel: Fix link detection on ksz87xx switch"
Hans de Goede hdegoede@redhat.com platform/x86: pcengines-apuv2: Add missing terminating entries to gpio-lookup tables
Florian Eckert fe@dev.tdt.de platform/x86: pcengines-apuv2: revert wiring up simswitch GPIO as LED
DENG Qingfang dqfext@gmail.com net: dsa: mt7530: add the missing RxUnicast MIB counter
Richard Fitzgerald rf@opensource.cirrus.com ASoC: cs42l42: Fix LRCLK frame start edge
Yajun Deng yajun.deng@linux.dev netfilter: nf_conntrack_bridge: Fix memory leak when error
Richard Fitzgerald rf@opensource.cirrus.com ASoC: cs42l42: Remove duplicate control for WNF filter frequency
Richard Fitzgerald rf@opensource.cirrus.com ASoC: cs42l42: Fix inversion of ADC Notch Switch control
Richard Fitzgerald rf@opensource.cirrus.com ASoC: cs42l42: Don't allow SND_SOC_DAIFMT_LEFT_J
Richard Fitzgerald rf@opensource.cirrus.com ASoC: cs42l42: Correct definition of ADC Volume control
Dongliang Mu mudongliangabcd@gmail.com ieee802154: hwsim: fix GPF in hwsim_new_edge_nl
Dongliang Mu mudongliangabcd@gmail.com ieee802154: hwsim: fix GPF in hwsim_set_edge_lqi
Dan Williams dan.j.williams@intel.com libnvdimm/region: Fix label activation vs errors
Dan Williams dan.j.williams@intel.com ACPI: NFIT: Fix support for virtual SPA ranges
Luis Henriques lhenriques@suse.de ceph: reduce contention in ceph_check_delayed_caps()
Greg Kroah-Hartman gregkh@linuxfoundation.org i2c: dev: zero out array used for i2c reads from userspace
Takashi Iwai tiwai@suse.de ASoC: intel: atom: Fix reference to PCM buffer address
Takashi Iwai tiwai@suse.de ASoC: xilinx: Fix reference to PCM buffer address
Colin Ian King colin.king@canonical.com iio: adc: Fix incorrect exit of for-loop
Chris Lesiak chris.lesiak@licor.com iio: humidity: hdc100x: Add margin to the conversion time
Uwe Kleine-König u.kleine-koenig@pengutronix.de iio: adc: ti-ads7950: Ensure CS is deasserted after reading channels
-------------
Diffstat:
Makefile | 4 +- arch/powerpc/kernel/kprobes.c | 3 +- arch/x86/kernel/apic/io_apic.c | 6 +- arch/x86/kernel/apic/msi.c | 13 ++- arch/x86/kernel/cpu/resctrl/monitor.c | 27 +++-- arch/x86/kvm/vmx/vmx.h | 2 +- arch/x86/tools/chkobjdump.awk | 1 + drivers/acpi/nfit/core.c | 3 + drivers/base/core.c | 1 + drivers/block/nbd.c | 14 ++- drivers/gpu/drm/meson/meson_registers.h | 5 + drivers/gpu/drm/meson/meson_viu.c | 7 +- drivers/i2c/i2c-dev.c | 5 +- drivers/iio/adc/palmas_gpadc.c | 4 +- drivers/iio/adc/ti-ads7950.c | 1 - drivers/iio/humidity/hdc100x.c | 6 +- drivers/iommu/intel-iommu.c | 7 +- drivers/net/dsa/lan9303-core.c | 34 +++--- drivers/net/dsa/lantiq_gswip.c | 14 ++- drivers/net/dsa/microchip/ksz_common.h | 8 +- drivers/net/dsa/mt7530.c | 1 + drivers/net/dsa/sja1105/sja1105_main.c | 4 +- drivers/net/ethernet/intel/iavf/iavf_main.c | 13 ++- .../ethernet/mellanox/mlx5/core/diag/fw_tracer.c | 11 +- drivers/net/ieee802154/mac802154_hwsim.c | 6 +- drivers/net/phy/micrel.c | 2 - drivers/net/ppp/ppp_generic.c | 2 +- drivers/nvdimm/namespace_devs.c | 17 ++- drivers/pci/msi.c | 125 +++++++++++++-------- drivers/platform/x86/pcengines-apuv2.c | 5 +- drivers/xen/events/events_base.c | 20 +++- fs/ceph/caps.c | 17 ++- fs/ceph/mds_client.c | 25 +++-- fs/ceph/snap.c | 54 +++++---- fs/ceph/super.h | 2 +- include/asm-generic/vmlinux.lds.h | 1 + include/linux/device.h | 1 + include/linux/inetdevice.h | 2 +- include/linux/irq.h | 2 + include/linux/msi.h | 2 +- include/net/psample.h | 2 + kernel/irq/chip.c | 5 +- kernel/irq/msi.c | 13 ++- kernel/irq/timings.c | 5 + net/bridge/br_if.c | 2 + net/bridge/netfilter/nf_conntrack_bridge.c | 6 + net/core/link_watch.c | 5 +- net/ieee802154/socket.c | 7 +- net/ipv4/igmp.c | 21 ++-- net/ipv4/tcp_bbr.c | 2 +- net/sched/act_mirred.c | 3 + net/vmw_vsock/virtio_transport.c | 7 +- sound/soc/codecs/cs42l42.c | 39 +++---- sound/soc/intel/atom/sst-mfld-platform-pcm.c | 3 +- sound/soc/xilinx/xlnx_formatter_pcm.c | 4 +- 55 files changed, 382 insertions(+), 219 deletions(-)
From: Uwe Kleine-König u.kleine-koenig@pengutronix.de
commit 9898cb24e454602beb6e17bacf9f97b26c85c955 upstream.
The ADS7950 requires that CS is deasserted after each SPI word. Before commit e2540da86ef8 ("iio: adc: ti-ads7950: use SPI_CS_WORD to reduce CPU usage") the driver used a message with one spi transfer per channel where each but the last one had .cs_change set to enforce a CS toggle. This was wrongly translated into a message with a single transfer and .cs_change set which results in a CS toggle after each word but the last which corrupts the first adc conversion of all readouts after the first readout.
Fixes: e2540da86ef8 ("iio: adc: ti-ads7950: use SPI_CS_WORD to reduce CPU usage") Signed-off-by: Uwe Kleine-König u.kleine-koenig@pengutronix.de Reviewed-by: David Lechner david@lechnology.com Tested-by: David Lechner david@lechnology.com Cc: Stable@vger.kernel.org Link: https://lore.kernel.org/r/20210709101110.1814294-1-u.kleine-koenig@pengutron... Signed-off-by: Jonathan Cameron Jonathan.Cameron@huawei.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/iio/adc/ti-ads7950.c | 1 - 1 file changed, 1 deletion(-)
--- a/drivers/iio/adc/ti-ads7950.c +++ b/drivers/iio/adc/ti-ads7950.c @@ -569,7 +569,6 @@ static int ti_ads7950_probe(struct spi_d st->ring_xfer.tx_buf = &st->tx_buf[0]; st->ring_xfer.rx_buf = &st->rx_buf[0]; /* len will be set later */ - st->ring_xfer.cs_change = true;
spi_message_add_tail(&st->ring_xfer, &st->ring_msg);
From: Chris Lesiak chris.lesiak@licor.com
commit 84edec86f449adea9ee0b4912a79ab8d9d65abb7 upstream.
The datasheets have the following note for the conversion time specification: "This parameter is specified by design and/or characterization and it is not tested in production."
Parts have been seen that require more time to do 14-bit conversions for the relative humidity channel. The result is ENXIO due to the address phase of a transfer not getting an ACK.
Delay an additional 1 ms per conversion to allow for additional margin.
Fixes: 4839367d99e3 ("iio: humidity: add HDC100x support") Signed-off-by: Chris Lesiak chris.lesiak@licor.com Acked-by: Matt Ranostay matt.ranostay@konsulko.com Link: https://lore.kernel.org/r/20210614141820.2034827-1-chris.lesiak@licor.com Cc: Stable@vger.kernel.org Signed-off-by: Jonathan Cameron Jonathan.Cameron@huawei.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/iio/humidity/hdc100x.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
--- a/drivers/iio/humidity/hdc100x.c +++ b/drivers/iio/humidity/hdc100x.c @@ -24,6 +24,8 @@ #include <linux/iio/trigger_consumer.h> #include <linux/iio/triggered_buffer.h>
+#include <linux/time.h> + #define HDC100X_REG_TEMP 0x00 #define HDC100X_REG_HUMIDITY 0x01
@@ -165,7 +167,7 @@ static int hdc100x_get_measurement(struc struct iio_chan_spec const *chan) { struct i2c_client *client = data->client; - int delay = data->adc_int_us[chan->address]; + int delay = data->adc_int_us[chan->address] + 1*USEC_PER_MSEC; int ret; __be16 val;
@@ -322,7 +324,7 @@ static irqreturn_t hdc100x_trigger_handl struct iio_dev *indio_dev = pf->indio_dev; struct hdc100x_data *data = iio_priv(indio_dev); struct i2c_client *client = data->client; - int delay = data->adc_int_us[0] + data->adc_int_us[1]; + int delay = data->adc_int_us[0] + data->adc_int_us[1] + 2*USEC_PER_MSEC; int ret;
/* dual read starts at temp register */
From: Colin Ian King colin.king@canonical.com
commit 5afc1540f13804a31bb704b763308e17688369c5 upstream.
Currently the for-loop that scans for the optimial adc_period iterates through all the possible adc_period levels because the exit logic in the loop is inverted. I believe the comparison should be swapped and the continue replaced with a break to exit the loop at the correct point.
Addresses-Coverity: ("Continue has no effect") Fixes: e08e19c331fb ("iio:adc: add iio driver for Palmas (twl6035/7) gpadc") Signed-off-by: Colin Ian King colin.king@canonical.com Link: https://lore.kernel.org/r/20210730071651.17394-1-colin.king@canonical.com Cc: stable@vger.kernel.org Signed-off-by: Jonathan Cameron Jonathan.Cameron@huawei.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/iio/adc/palmas_gpadc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/drivers/iio/adc/palmas_gpadc.c +++ b/drivers/iio/adc/palmas_gpadc.c @@ -656,8 +656,8 @@ static int palmas_adc_wakeup_configure(s
adc_period = adc->auto_conversion_period; for (i = 0; i < 16; ++i) { - if (((1000 * (1 << i)) / 32) < adc_period) - continue; + if (((1000 * (1 << i)) / 32) >= adc_period) + break; } if (i > 0) i--;
From: Takashi Iwai tiwai@suse.de
commit 42bc62c9f1d3d4880bdc27acb5ab4784209bb0b0 upstream.
PCM buffers might be allocated dynamically when the buffer preallocation failed or a larger buffer is requested, and it's not guaranteed that substream->dma_buffer points to the actually used buffer. The driver needs to refer to substream->runtime->dma_addr instead for the buffer address.
Cc: stable@vger.kernel.org Signed-off-by: Takashi Iwai tiwai@suse.de Link: https://lore.kernel.org/r/20210728112353.6675-4-tiwai@suse.de Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- sound/soc/xilinx/xlnx_formatter_pcm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/sound/soc/xilinx/xlnx_formatter_pcm.c +++ b/sound/soc/xilinx/xlnx_formatter_pcm.c @@ -461,8 +461,8 @@ static int xlnx_formatter_pcm_hw_params(
stream_data->buffer_size = size;
- low = lower_32_bits(substream->dma_buffer.addr); - high = upper_32_bits(substream->dma_buffer.addr); + low = lower_32_bits(runtime->dma_addr); + high = upper_32_bits(runtime->dma_addr); writel(low, stream_data->mmio + XLNX_AUD_BUFF_ADDR_LSB); writel(high, stream_data->mmio + XLNX_AUD_BUFF_ADDR_MSB);
From: Takashi Iwai tiwai@suse.de
commit 2e6b836312a477d647a7920b56810a5a25f6c856 upstream.
PCM buffers might be allocated dynamically when the buffer preallocation failed or a larger buffer is requested, and it's not guaranteed that substream->dma_buffer points to the actually used buffer. The address should be retrieved from runtime->dma_addr, instead of substream->dma_buffer (and shouldn't use virt_to_phys).
Also, remove the line overriding runtime->dma_area superfluously, which was already set up at the PCM buffer allocation.
Cc: Cezary Rojewski cezary.rojewski@intel.com Cc: Pierre-Louis Bossart pierre-louis.bossart@linux.intel.com Cc: stable@vger.kernel.org Signed-off-by: Takashi Iwai tiwai@suse.de Link: https://lore.kernel.org/r/20210728112353.6675-3-tiwai@suse.de Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- sound/soc/intel/atom/sst-mfld-platform-pcm.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
--- a/sound/soc/intel/atom/sst-mfld-platform-pcm.c +++ b/sound/soc/intel/atom/sst-mfld-platform-pcm.c @@ -127,7 +127,7 @@ static void sst_fill_alloc_params(struct snd_pcm_uframes_t period_size; ssize_t periodbytes; ssize_t buffer_bytes = snd_pcm_lib_buffer_bytes(substream); - u32 buffer_addr = virt_to_phys(substream->dma_buffer.area); + u32 buffer_addr = substream->runtime->dma_addr;
channels = substream->runtime->channels; period_size = substream->runtime->period_size; @@ -233,7 +233,6 @@ static int sst_platform_alloc_stream(str /* set codec params and inform SST driver the same */ sst_fill_pcm_params(substream, ¶m); sst_fill_alloc_params(substream, &alloc_params); - substream->runtime->dma_area = substream->dma_buffer.area; str_params.sparams = param; str_params.aparams = alloc_params; str_params.codec = SST_CODEC_TYPE_PCM;
From: Greg Kroah-Hartman gregkh@linuxfoundation.org
commit 86ff25ed6cd8240d18df58930bd8848b19fce308 upstream.
If an i2c driver happens to not provide the full amount of data that a user asks for, it is possible that some uninitialized data could be sent to userspace. While all in-kernel drivers look to be safe, just be sure by initializing the buffer to zero before it is passed to the i2c driver so that any future drivers will not have this issue.
Also properly copy the amount of data recvieved to the userspace buffer, as pointed out by Dan Carpenter.
Reported-by: Eric Dumazet edumazet@google.com Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Wolfram Sang wsa@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/i2c/i2c-dev.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
--- a/drivers/i2c/i2c-dev.c +++ b/drivers/i2c/i2c-dev.c @@ -141,7 +141,7 @@ static ssize_t i2cdev_read(struct file * if (count > 8192) count = 8192;
- tmp = kmalloc(count, GFP_KERNEL); + tmp = kzalloc(count, GFP_KERNEL); if (tmp == NULL) return -ENOMEM;
@@ -150,7 +150,8 @@ static ssize_t i2cdev_read(struct file *
ret = i2c_master_recv(client, tmp, count); if (ret >= 0) - ret = copy_to_user(buf, tmp, count) ? -EFAULT : ret; + if (copy_to_user(buf, tmp, ret)) + ret = -EFAULT; kfree(tmp); return ret; }
From: Luis Henriques lhenriques@suse.de
commit bf2ba432213fade50dd39f2e348085b758c0726e upstream.
Function ceph_check_delayed_caps() is called from the mdsc->delayed_work workqueue and it can be kept looping for quite some time if caps keep being added back to the mdsc->cap_delay_list. This may result in the watchdog tainting the kernel with the softlockup flag.
This patch breaks this loop if the caps have been recently (i.e. during the loop execution). Any new caps added to the list will be handled in the next run.
Also, allow schedule_delayed() callers to explicitly set the delay value instead of defaulting to 5s, so we can ensure that it runs soon afterward if it looks like there is more work.
Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/46284 Signed-off-by: Luis Henriques lhenriques@suse.de Reviewed-by: Jeff Layton jlayton@kernel.org Signed-off-by: Ilya Dryomov idryomov@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/ceph/caps.c | 17 ++++++++++++++++- fs/ceph/mds_client.c | 25 ++++++++++++++++--------- fs/ceph/super.h | 2 +- 3 files changed, 33 insertions(+), 11 deletions(-)
--- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -4053,12 +4053,20 @@ bad:
/* * Delayed work handler to process end of delayed cap release LRU list. + * + * If new caps are added to the list while processing it, these won't get + * processed in this run. In this case, the ci->i_hold_caps_max will be + * returned so that the work can be scheduled accordingly. */ -void ceph_check_delayed_caps(struct ceph_mds_client *mdsc) +unsigned long ceph_check_delayed_caps(struct ceph_mds_client *mdsc) { struct inode *inode; struct ceph_inode_info *ci; int flags = CHECK_CAPS_NODELAY; + struct ceph_mount_options *opt = mdsc->fsc->mount_options; + unsigned long delay_max = opt->caps_wanted_delay_max * HZ; + unsigned long loop_start = jiffies; + unsigned long delay = 0;
dout("check_delayed_caps\n"); while (1) { @@ -4068,6 +4076,11 @@ void ceph_check_delayed_caps(struct ceph ci = list_first_entry(&mdsc->cap_delay_list, struct ceph_inode_info, i_cap_delay_list); + if (time_before(loop_start, ci->i_hold_caps_max - delay_max)) { + dout("%s caps added recently. Exiting loop", __func__); + delay = ci->i_hold_caps_max; + break; + } if ((ci->i_ceph_flags & CEPH_I_FLUSH) == 0 && time_before(jiffies, ci->i_hold_caps_max)) break; @@ -4084,6 +4097,8 @@ void ceph_check_delayed_caps(struct ceph } } spin_unlock(&mdsc->cap_delay_lock); + + return delay; }
/* --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -4049,22 +4049,29 @@ static void maybe_recover_session(struct }
/* - * delayed work -- periodically trim expired leases, renew caps with mds + * delayed work -- periodically trim expired leases, renew caps with mds. If + * the @delay parameter is set to 0 or if it's more than 5 secs, the default + * workqueue delay value of 5 secs will be used. */ -static void schedule_delayed(struct ceph_mds_client *mdsc) +static void schedule_delayed(struct ceph_mds_client *mdsc, unsigned long delay) { - int delay = 5; - unsigned hz = round_jiffies_relative(HZ * delay); - schedule_delayed_work(&mdsc->delayed_work, hz); + unsigned long max_delay = HZ * 5; + + /* 5 secs default delay */ + if (!delay || (delay > max_delay)) + delay = max_delay; + schedule_delayed_work(&mdsc->delayed_work, + round_jiffies_relative(delay)); }
static void delayed_work(struct work_struct *work) { - int i; struct ceph_mds_client *mdsc = container_of(work, struct ceph_mds_client, delayed_work.work); + unsigned long delay; int renew_interval; int renew_caps; + int i;
dout("mdsc delayed_work\n");
@@ -4119,7 +4126,7 @@ static void delayed_work(struct work_str } mutex_unlock(&mdsc->mutex);
- ceph_check_delayed_caps(mdsc); + delay = ceph_check_delayed_caps(mdsc);
ceph_queue_cap_reclaim_work(mdsc);
@@ -4127,7 +4134,7 @@ static void delayed_work(struct work_str
maybe_recover_session(mdsc);
- schedule_delayed(mdsc); + schedule_delayed(mdsc, delay); }
int ceph_mdsc_init(struct ceph_fs_client *fsc) @@ -4600,7 +4607,7 @@ void ceph_mdsc_handle_mdsmap(struct ceph mdsc->mdsmap->m_epoch);
mutex_unlock(&mdsc->mutex); - schedule_delayed(mdsc); + schedule_delayed(mdsc, 0); return;
bad_unlock: --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -1064,7 +1064,7 @@ extern void ceph_flush_snaps(struct ceph extern bool __ceph_should_report_size(struct ceph_inode_info *ci); extern void ceph_check_caps(struct ceph_inode_info *ci, int flags, struct ceph_mds_session *session); -extern void ceph_check_delayed_caps(struct ceph_mds_client *mdsc); +extern unsigned long ceph_check_delayed_caps(struct ceph_mds_client *mdsc); extern void ceph_flush_dirty_caps(struct ceph_mds_client *mdsc); extern int ceph_drop_caps_for_unlink(struct inode *inode); extern int ceph_encode_inode_release(void **p, struct inode *inode,
From: Dan Williams dan.j.williams@intel.com
commit b93dfa6bda4d4e88e5386490f2b277a26958f9d3 upstream.
Fix the NFIT parsing code to treat a 0 index in a SPA Range Structure as a special case and not match Region Mapping Structures that use 0 to indicate that they are not mapped. Without this fix some platform BIOS descriptions of "virtual disk" ranges do not result in the pmem driver attaching to the range.
Details: In addition to typical persistent memory ranges, the ACPI NFIT may also convey "virtual" ranges. These ranges are indicated by a UUID in the SPA Range Structure of UUID_VOLATILE_VIRTUAL_DISK, UUID_VOLATILE_VIRTUAL_CD, UUID_PERSISTENT_VIRTUAL_DISK, or UUID_PERSISTENT_VIRTUAL_CD. The critical difference between virtual ranges and UUID_PERSISTENT_MEMORY, is that virtual do not support associations with Region Mapping Structures. For this reason the "index" value of virtual SPA Range Structures is allowed to be 0. If a platform BIOS decides to represent NVDIMMs with disconnected "Region Mapping Structures" (range-index == 0), the kernel may falsely associate them with standalone ranges where the "SPA Range Structure Index" is also zero. When this happens the driver may falsely require labels where "virtual disks" are expected to be label-less. I.e. "label-less" is where the namespace-range == region-range and the pmem driver attaches with no user action to create a namespace.
Cc: Jacek Zloch jacek.zloch@intel.com Cc: Lukasz Sobieraj lukasz.sobieraj@intel.com Cc: "Lee, Chun-Yi" jlee@suse.com Cc: stable@vger.kernel.org Fixes: c2f32acdf848 ("acpi, nfit: treat virtual ramdisk SPA as pmem region") Reported-by: Krzysztof Rusocki krzysztof.rusocki@intel.com Reported-by: Damian Bassa damian.bassa@intel.com Reviewed-by: Jeff Moyer jmoyer@redhat.com Link: https://lore.kernel.org/r/162870796589.2521182.1240403310175570220.stgit@dwi... Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/acpi/nfit/core.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -2973,6 +2973,9 @@ static int acpi_nfit_register_region(str struct acpi_nfit_memory_map *memdev = nfit_memdev->memdev; struct nd_mapping_desc *mapping;
+ /* range index 0 == unmapped in SPA or invalid-SPA */ + if (memdev->range_index == 0 || spa->range_index == 0) + continue; if (memdev->range_index != spa->range_index) continue; if (count >= ND_MAX_MAPPINGS) {
From: Dan Williams dan.j.williams@intel.com
commit d9cee9f85b22fab88d2b76d2e92b18e3d0e6aa8c upstream.
There are a few scenarios where init_active_labels() can return without registering deactivate_labels() to run when the region is disabled. In particular label error injection creates scenarios where a DIMM is disabled, but labels on other DIMMs in the region become activated.
Arrange for init_active_labels() to always register deactivate_labels().
Reported-by: Krzysztof Kensicki krzysztof.kensicki@intel.com Cc: stable@vger.kernel.org Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation.") Reviewed-by: Jeff Moyer jmoyer@redhat.com Link: https://lore.kernel.org/r/162766356450.3223041.1183118139023841447.stgit@dwi... Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/nvdimm/namespace_devs.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-)
--- a/drivers/nvdimm/namespace_devs.c +++ b/drivers/nvdimm/namespace_devs.c @@ -2486,7 +2486,7 @@ static void deactivate_labels(void *regi
static int init_active_labels(struct nd_region *nd_region) { - int i; + int i, rc = 0;
for (i = 0; i < nd_region->ndr_mappings; i++) { struct nd_mapping *nd_mapping = &nd_region->mapping[i]; @@ -2505,13 +2505,14 @@ static int init_active_labels(struct nd_ else if (test_bit(NDD_ALIASING, &nvdimm->flags)) /* fail, labels needed to disambiguate dpa */; else - return 0; + continue;
dev_err(&nd_region->dev, "%s: is %s, failing probe\n", dev_name(&nd_mapping->nvdimm->dev), test_bit(NDD_LOCKED, &nvdimm->flags) ? "locked" : "disabled"); - return -ENXIO; + rc = -ENXIO; + goto out; } nd_mapping->ndd = ndd; atomic_inc(&nvdimm->busy); @@ -2545,13 +2546,17 @@ static int init_active_labels(struct nd_ break; }
- if (i < nd_region->ndr_mappings) { + if (i < nd_region->ndr_mappings) + rc = -ENOMEM; + +out: + if (rc) { deactivate_labels(nd_region); - return -ENOMEM; + return rc; }
return devm_add_action_or_reset(&nd_region->dev, deactivate_labels, - nd_region); + nd_region); }
int nd_region_register_namespaces(struct nd_region *nd_region, int *err)
From: Dongliang Mu mudongliangabcd@gmail.com
[ Upstream commit e9faf53c5a5d01f6f2a09ae28ec63a3bbd6f64fd ]
Both MAC802154_HWSIM_ATTR_RADIO_ID and MAC802154_HWSIM_ATTR_RADIO_EDGE, MAC802154_HWSIM_EDGE_ATTR_ENDPOINT_ID and MAC802154_HWSIM_EDGE_ATTR_LQI must be present to fix GPF.
Fixes: f25da51fdc38 ("ieee802154: hwsim: add replacement for fakelb") Signed-off-by: Dongliang Mu mudongliangabcd@gmail.com Acked-by: Alexander Aring aahringo@redhat.com Link: https://lore.kernel.org/r/20210705131321.217111-1-mudongliangabcd@gmail.com Signed-off-by: Stefan Schmidt stefan@datenfreihafen.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ieee802154/mac802154_hwsim.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ieee802154/mac802154_hwsim.c b/drivers/net/ieee802154/mac802154_hwsim.c index 79d74763cf24..da7f8cc9b181 100644 --- a/drivers/net/ieee802154/mac802154_hwsim.c +++ b/drivers/net/ieee802154/mac802154_hwsim.c @@ -528,14 +528,14 @@ static int hwsim_set_edge_lqi(struct sk_buff *msg, struct genl_info *info) u32 v0, v1; u8 lqi;
- if (!info->attrs[MAC802154_HWSIM_ATTR_RADIO_ID] && + if (!info->attrs[MAC802154_HWSIM_ATTR_RADIO_ID] || !info->attrs[MAC802154_HWSIM_ATTR_RADIO_EDGE]) return -EINVAL;
if (nla_parse_nested_deprecated(edge_attrs, MAC802154_HWSIM_EDGE_ATTR_MAX, info->attrs[MAC802154_HWSIM_ATTR_RADIO_EDGE], hwsim_edge_policy, NULL)) return -EINVAL;
- if (!edge_attrs[MAC802154_HWSIM_EDGE_ATTR_ENDPOINT_ID] && + if (!edge_attrs[MAC802154_HWSIM_EDGE_ATTR_ENDPOINT_ID] || !edge_attrs[MAC802154_HWSIM_EDGE_ATTR_LQI]) return -EINVAL;
From: Dongliang Mu mudongliangabcd@gmail.com
[ Upstream commit 889d0e7dc68314a273627d89cbb60c09e1cc1c25 ]
Both MAC802154_HWSIM_ATTR_RADIO_ID and MAC802154_HWSIM_ATTR_RADIO_EDGE must be present to fix GPF.
Fixes: f25da51fdc38 ("ieee802154: hwsim: add replacement for fakelb") Signed-off-by: Dongliang Mu mudongliangabcd@gmail.com Acked-by: Alexander Aring aahringo@redhat.com Link: https://lore.kernel.org/r/20210707155633.1486603-1-mudongliangabcd@gmail.com Signed-off-by: Stefan Schmidt stefan@datenfreihafen.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ieee802154/mac802154_hwsim.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ieee802154/mac802154_hwsim.c b/drivers/net/ieee802154/mac802154_hwsim.c index da7f8cc9b181..2a78084aeaed 100644 --- a/drivers/net/ieee802154/mac802154_hwsim.c +++ b/drivers/net/ieee802154/mac802154_hwsim.c @@ -418,7 +418,7 @@ static int hwsim_new_edge_nl(struct sk_buff *msg, struct genl_info *info) struct hwsim_edge *e; u32 v0, v1;
- if (!info->attrs[MAC802154_HWSIM_ATTR_RADIO_ID] && + if (!info->attrs[MAC802154_HWSIM_ATTR_RADIO_ID] || !info->attrs[MAC802154_HWSIM_ATTR_RADIO_EDGE]) return -EINVAL;
From: Richard Fitzgerald rf@opensource.cirrus.com
[ Upstream commit ee86f680ff4c9b406d49d4e22ddf10805b8a2137 ]
The ADC volume is a signed 8-bit number with range -97 to +12, with -97 being mute. Use a SOC_SINGLE_S8_TLV() to define this and fix the DECLARE_TLV_DB_SCALE() to have the correct start and mute flag.
Fixes: 2c394ca79604 ("ASoC: Add support for CS42L42 codec") Signed-off-by: Richard Fitzgerald rf@opensource.cirrus.com Link: https://lore.kernel.org/r/20210729170929.6589-1-rf@opensource.cirrus.com Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- sound/soc/codecs/cs42l42.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/sound/soc/codecs/cs42l42.c b/sound/soc/codecs/cs42l42.c index 5faf8877137a..5c6d288a92ab 100644 --- a/sound/soc/codecs/cs42l42.c +++ b/sound/soc/codecs/cs42l42.c @@ -403,7 +403,7 @@ static const struct regmap_config cs42l42_regmap = { .use_single_write = true, };
-static DECLARE_TLV_DB_SCALE(adc_tlv, -9600, 100, false); +static DECLARE_TLV_DB_SCALE(adc_tlv, -9700, 100, true); static DECLARE_TLV_DB_SCALE(mixer_tlv, -6300, 100, true);
static const char * const cs42l42_hpf_freq_text[] = { @@ -442,8 +442,7 @@ static const struct snd_kcontrol_new cs42l42_snd_controls[] = { CS42L42_ADC_INV_SHIFT, true, false), SOC_SINGLE("ADC Boost Switch", CS42L42_ADC_CTL, CS42L42_ADC_DIG_BOOST_SHIFT, true, false), - SOC_SINGLE_SX_TLV("ADC Volume", CS42L42_ADC_VOLUME, - CS42L42_ADC_VOL_SHIFT, 0xA0, 0x6C, adc_tlv), + SOC_SINGLE_S8_TLV("ADC Volume", CS42L42_ADC_VOLUME, -97, 12, adc_tlv), SOC_SINGLE("ADC WNF Switch", CS42L42_ADC_WNF_HPF_CTL, CS42L42_ADC_WNF_EN_SHIFT, true, false), SOC_SINGLE("ADC HPF Switch", CS42L42_ADC_WNF_HPF_CTL,
From: Richard Fitzgerald rf@opensource.cirrus.com
[ Upstream commit 64324bac750b84ca54711fb7d332132fcdb87293 ]
The driver has no support for left-justified protocol so it should not have been allowing this to be passed to cs42l42_set_dai_fmt().
Signed-off-by: Richard Fitzgerald rf@opensource.cirrus.com Fixes: 2c394ca79604 ("ASoC: Add support for CS42L42 codec") Link: https://lore.kernel.org/r/20210729170929.6589-2-rf@opensource.cirrus.com Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- sound/soc/codecs/cs42l42.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/sound/soc/codecs/cs42l42.c b/sound/soc/codecs/cs42l42.c index 5c6d288a92ab..978f5df1ff79 100644 --- a/sound/soc/codecs/cs42l42.c +++ b/sound/soc/codecs/cs42l42.c @@ -772,7 +772,6 @@ static int cs42l42_set_dai_fmt(struct snd_soc_dai *codec_dai, unsigned int fmt) /* interface format */ switch (fmt & SND_SOC_DAIFMT_FORMAT_MASK) { case SND_SOC_DAIFMT_I2S: - case SND_SOC_DAIFMT_LEFT_J: break; default: return -EINVAL;
From: Richard Fitzgerald rf@opensource.cirrus.com
[ Upstream commit 30615bd21b4cc3c3bb5ae8bd70e2a915cc5f75c7 ]
The underlying register field has inverted sense (0 = enabled) so the control definition must be marked as inverted.
Signed-off-by: Richard Fitzgerald rf@opensource.cirrus.com Fixes: 2c394ca79604 ("ASoC: Add support for CS42L42 codec") Link: https://lore.kernel.org/r/20210803160834.9005-1-rf@opensource.cirrus.com Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- sound/soc/codecs/cs42l42.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/sound/soc/codecs/cs42l42.c b/sound/soc/codecs/cs42l42.c index 978f5df1ff79..032ec4bd060b 100644 --- a/sound/soc/codecs/cs42l42.c +++ b/sound/soc/codecs/cs42l42.c @@ -435,7 +435,7 @@ static SOC_ENUM_SINGLE_DECL(cs42l42_wnf05_freq_enum, CS42L42_ADC_WNF_HPF_CTL, static const struct snd_kcontrol_new cs42l42_snd_controls[] = { /* ADC Volume and Filter Controls */ SOC_SINGLE("ADC Notch Switch", CS42L42_ADC_CTL, - CS42L42_ADC_NOTCH_DIS_SHIFT, true, false), + CS42L42_ADC_NOTCH_DIS_SHIFT, true, true), SOC_SINGLE("ADC Weak Force Switch", CS42L42_ADC_CTL, CS42L42_ADC_FORCE_WEAK_VCM_SHIFT, true, false), SOC_SINGLE("ADC Invert Switch", CS42L42_ADC_CTL,
From: Richard Fitzgerald rf@opensource.cirrus.com
[ Upstream commit 8b353bbeae20e2214c9d9d88bcb2fda4ba145d83 ]
The driver was defining two ALSA controls that both change the same register field for the wind noise filter corner frequency. The filter response has two corners, at different frequencies, and the duplicate controls most likely were an attempt to be able to set the value using either of the frequencies.
However, having two controls changing the same field can be problematic and it is unnecessary. Both frequencies are related to each other so setting one implies exactly what the other would be.
Removing a control affects user-side code, but there is currently no known use of the removed control so it would be best to remove it now before it becomes a problem.
Signed-off-by: Richard Fitzgerald rf@opensource.cirrus.com Fixes: 2c394ca79604 ("ASoC: Add support for CS42L42 codec") Link: https://lore.kernel.org/r/20210803160834.9005-2-rf@opensource.cirrus.com Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- sound/soc/codecs/cs42l42.c | 10 ---------- 1 file changed, 10 deletions(-)
diff --git a/sound/soc/codecs/cs42l42.c b/sound/soc/codecs/cs42l42.c index 032ec4bd060b..eb37f88f8233 100644 --- a/sound/soc/codecs/cs42l42.c +++ b/sound/soc/codecs/cs42l42.c @@ -423,15 +423,6 @@ static SOC_ENUM_SINGLE_DECL(cs42l42_wnf3_freq_enum, CS42L42_ADC_WNF_HPF_CTL, CS42L42_ADC_WNF_CF_SHIFT, cs42l42_wnf3_freq_text);
-static const char * const cs42l42_wnf05_freq_text[] = { - "280Hz", "315Hz", "350Hz", "385Hz", - "420Hz", "455Hz", "490Hz", "525Hz" -}; - -static SOC_ENUM_SINGLE_DECL(cs42l42_wnf05_freq_enum, CS42L42_ADC_WNF_HPF_CTL, - CS42L42_ADC_WNF_CF_SHIFT, - cs42l42_wnf05_freq_text); - static const struct snd_kcontrol_new cs42l42_snd_controls[] = { /* ADC Volume and Filter Controls */ SOC_SINGLE("ADC Notch Switch", CS42L42_ADC_CTL, @@ -449,7 +440,6 @@ static const struct snd_kcontrol_new cs42l42_snd_controls[] = { CS42L42_ADC_HPF_EN_SHIFT, true, false), SOC_ENUM("HPF Corner Freq", cs42l42_hpf_freq_enum), SOC_ENUM("WNF 3dB Freq", cs42l42_wnf3_freq_enum), - SOC_ENUM("WNF 05dB Freq", cs42l42_wnf05_freq_enum),
/* DAC Volume and Filter Controls */ SOC_SINGLE("DACA Invert Switch", CS42L42_DAC_CTL1,
From: Yajun Deng yajun.deng@linux.dev
[ Upstream commit 38ea9def5b62f9193f6bad96c5d108e2830ecbde ]
It should be added kfree_skb_list() when err is not equal to zero in nf_br_ip_fragment().
v2: keep this aligned with IPv6. v3: modify iter.frag_list to iter.frag.
Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system") Signed-off-by: Yajun Deng yajun.deng@linux.dev Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/bridge/netfilter/nf_conntrack_bridge.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c index 8d033a75a766..fdbed3158555 100644 --- a/net/bridge/netfilter/nf_conntrack_bridge.c +++ b/net/bridge/netfilter/nf_conntrack_bridge.c @@ -88,6 +88,12 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
skb = ip_fraglist_next(&iter); } + + if (!err) + return 0; + + kfree_skb_list(iter.frag); + return err; } slow_path:
From: Richard Fitzgerald rf@opensource.cirrus.com
[ Upstream commit 0c2f2ad4f16a58879463d0979a54293f8f296d6f ]
An I2S frame starts on the falling edge of LRCLK so ASP_STP must be 0.
At the same time, move other format settings in the same register from cs42l42_pll_config() to cs42l42_set_dai_fmt() where you'd expect to find them, and merge into a single write.
Signed-off-by: Richard Fitzgerald rf@opensource.cirrus.com Fixes: 2c394ca79604 ("ASoC: Add support for CS42L42 codec") Link: https://lore.kernel.org/r/20210805161111.10410-2-rf@opensource.cirrus.com Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- sound/soc/codecs/cs42l42.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/sound/soc/codecs/cs42l42.c b/sound/soc/codecs/cs42l42.c index eb37f88f8233..6825e874785f 100644 --- a/sound/soc/codecs/cs42l42.c +++ b/sound/soc/codecs/cs42l42.c @@ -658,15 +658,6 @@ static int cs42l42_pll_config(struct snd_soc_component *component) CS42L42_FSYNC_PULSE_WIDTH_MASK, CS42L42_FRAC1_VAL(fsync - 1) << CS42L42_FSYNC_PULSE_WIDTH_SHIFT); - snd_soc_component_update_bits(component, - CS42L42_ASP_FRM_CFG, - CS42L42_ASP_5050_MASK, - CS42L42_ASP_5050_MASK); - /* Set the frame delay to 1.0 SCLK clocks */ - snd_soc_component_update_bits(component, CS42L42_ASP_FRM_CFG, - CS42L42_ASP_FSD_MASK, - CS42L42_ASP_FSD_1_0 << - CS42L42_ASP_FSD_SHIFT); /* Set the sample rates (96k or lower) */ snd_soc_component_update_bits(component, CS42L42_FS_RATE_EN, CS42L42_FS_EN_MASK, @@ -762,6 +753,18 @@ static int cs42l42_set_dai_fmt(struct snd_soc_dai *codec_dai, unsigned int fmt) /* interface format */ switch (fmt & SND_SOC_DAIFMT_FORMAT_MASK) { case SND_SOC_DAIFMT_I2S: + /* + * 5050 mode, frame starts on falling edge of LRCLK, + * frame delayed by 1.0 SCLKs + */ + snd_soc_component_update_bits(component, + CS42L42_ASP_FRM_CFG, + CS42L42_ASP_STP_MASK | + CS42L42_ASP_5050_MASK | + CS42L42_ASP_FSD_MASK, + CS42L42_ASP_5050_MASK | + (CS42L42_ASP_FSD_1_0 << + CS42L42_ASP_FSD_SHIFT)); break; default: return -EINVAL;
From: DENG Qingfang dqfext@gmail.com
[ Upstream commit aff51c5da3208bd164381e1488998667269c6cf4 ]
Add the missing RxUnicast counter.
Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 switch") Signed-off-by: DENG Qingfang dqfext@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/dsa/mt7530.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c index 071e5015bf91..e1a3c33fdad9 100644 --- a/drivers/net/dsa/mt7530.c +++ b/drivers/net/dsa/mt7530.c @@ -45,6 +45,7 @@ static const struct mt7530_mib_desc mt7530_mib[] = { MIB_DESC(2, 0x48, "TxBytes"), MIB_DESC(1, 0x60, "RxDrop"), MIB_DESC(1, 0x64, "RxFiltering"), + MIB_DESC(1, 0x68, "RxUnicast"), MIB_DESC(1, 0x6c, "RxMulticast"), MIB_DESC(1, 0x70, "RxBroadcast"), MIB_DESC(1, 0x74, "RxAlignErr"),
From: Florian Eckert fe@dev.tdt.de
[ Upstream commit f560cd502190a9fd0ca8db0a15c5cca7d9091d2c ]
This reverts commit 5037d4ddda31c2dbbb018109655f61054b1756dc.
Explanation why this does not work: This change connects the simswap to the LED subsystem of the kernel.
From my point of view, it's nonsense. If we do it this way, then this
can be switched relatively easily via the LED subsystem (trigger: none/default-on) and that is dangerous! If this is used, it would be unfavorable, since there is also another trigger (trigger: heartbeat/netdev).
Therefore, this simswap GPIO should remain in the GPIO subsystem and be switched via it and not be connected to the LED subsystem. To avoid the problems mentioned above. The LED subsystem is not made for this and it is not a good compromise, but rather dangerous.
Signed-off-by: Florian Eckert fe@dev.tdt.de Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/platform/x86/pcengines-apuv2.c | 3 --- 1 file changed, 3 deletions(-)
diff --git a/drivers/platform/x86/pcengines-apuv2.c b/drivers/platform/x86/pcengines-apuv2.c index c32daf087640..cb95b4ede824 100644 --- a/drivers/platform/x86/pcengines-apuv2.c +++ b/drivers/platform/x86/pcengines-apuv2.c @@ -78,7 +78,6 @@ static const struct gpio_led apu2_leds[] = { { .name = "apu:green:1" }, { .name = "apu:green:2" }, { .name = "apu:green:3" }, - { .name = "apu:simswap" }, };
static const struct gpio_led_platform_data apu2_leds_pdata = { @@ -95,8 +94,6 @@ static struct gpiod_lookup_table gpios_led_table = { NULL, 1, GPIO_ACTIVE_LOW), GPIO_LOOKUP_IDX(AMD_FCH_GPIO_DRIVER_NAME, APU2_GPIO_LINE_LED3, NULL, 2, GPIO_ACTIVE_LOW), - GPIO_LOOKUP_IDX(AMD_FCH_GPIO_DRIVER_NAME, APU2_GPIO_LINE_SIMSWAP, - NULL, 3, GPIO_ACTIVE_LOW), } };
From: Hans de Goede hdegoede@redhat.com
[ Upstream commit 9d7b132e62e41b7d49bf157aeaf9147c27492e0f ]
The gpiod_lookup_table.table passed to gpiod_add_lookup_table() must be terminated with an empty entry, add this.
Note we have likely been getting away with this not being present because the GPIO lookup code first matches on the dev_id, causing most lookups to skip checking the table and the lookups which do check the table will find a matching entry before reaching the end. With that said, terminating these tables properly still is obviously the correct thing to do.
Fixes: f8eb0235f659 ("x86: pcengines apuv2 gpio/leds/keys platform driver") Signed-off-by: Hans de Goede hdegoede@redhat.com Link: https://lore.kernel.org/r/20210806115515.12184-1-hdegoede@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/platform/x86/pcengines-apuv2.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/platform/x86/pcengines-apuv2.c b/drivers/platform/x86/pcengines-apuv2.c index cb95b4ede824..5db6f7394ef2 100644 --- a/drivers/platform/x86/pcengines-apuv2.c +++ b/drivers/platform/x86/pcengines-apuv2.c @@ -94,6 +94,7 @@ static struct gpiod_lookup_table gpios_led_table = { NULL, 1, GPIO_ACTIVE_LOW), GPIO_LOOKUP_IDX(AMD_FCH_GPIO_DRIVER_NAME, APU2_GPIO_LINE_LED3, NULL, 2, GPIO_ACTIVE_LOW), + {} /* Terminating entry */ } };
@@ -123,6 +124,7 @@ static struct gpiod_lookup_table gpios_key_table = { .table = { GPIO_LOOKUP_IDX(AMD_FCH_GPIO_DRIVER_NAME, APU2_GPIO_LINE_MODESW, NULL, 0, GPIO_ACTIVE_LOW), + {} /* Terminating entry */ } };
From: Ben Hutchings ben.hutchings@mind.be
[ Upstream commit 2383cb9497d113360137a2be308b390faa80632d ]
Commit a5e63c7d38d5 "net: phy: micrel: Fix detection of ksz87xx switch" broke link detection on the external ports of the KSZ8795.
The previously unused phy_driver structure for these devices specifies config_aneg and read_status functions that appear to be designed for a fixed link and do not work with the embedded PHYs in the KSZ8795.
Delete the use of these functions in favour of the generic PHY implementations which were used previously.
Fixes: a5e63c7d38d5 ("net: phy: micrel: Fix detection of ksz87xx switch") Signed-off-by: Ben Hutchings ben.hutchings@mind.be Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/phy/micrel.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c index 910ab2182158..f95bd1b0fb96 100644 --- a/drivers/net/phy/micrel.c +++ b/drivers/net/phy/micrel.c @@ -1184,8 +1184,6 @@ static struct phy_driver ksphy_driver[] = { .name = "Micrel KSZ87XX Switch", /* PHY_BASIC_FEATURES */ .config_init = kszphy_config_init, - .config_aneg = ksz8873mll_config_aneg, - .read_status = ksz8873mll_read_status, .match_phy_device = ksz8795_match_phy_device, .suspend = genphy_suspend, .resume = genphy_resume,
From: Pali Rohár pali@kernel.org
[ Upstream commit 2459dcb96bcba94c08d6861f8a050185ff301672 ]
IFLA_IFNAME is nul-term string which means that IFLA_IFNAME buffer can be larger than length of string which contains.
Function __rtnl_newlink() generates new own ifname if either IFLA_IFNAME was not specified at all or userspace passed empty nul-term string.
It is expected that if userspace does not specify ifname for new ppp netdev then kernel generates one in format "ppp<id>" where id matches to the ppp unit id which can be later obtained by PPPIOCGUNIT ioctl.
And it works in this way if IFLA_IFNAME is not specified at all. But it does not work when IFLA_IFNAME is specified with empty string.
So fix this logic also for empty IFLA_IFNAME in ppp_nl_newlink() function and correctly generates ifname based on ppp unit identifier if userspace did not provided preferred ifname.
Without this patch when IFLA_IFNAME was specified with empty string then kernel created a new ppp interface in format "ppp<id>" but id did not match ppp unit id returned by PPPIOCGUNIT ioctl. In this case id was some number generated by __rtnl_newlink() function.
Signed-off-by: Pali Rohár pali@kernel.org Fixes: bb8082f69138 ("ppp: build ifname using unit identifier for rtnl based devices") Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ppp/ppp_generic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c index b7e2b4a0f3c6..c6c41a7836c9 100644 --- a/drivers/net/ppp/ppp_generic.c +++ b/drivers/net/ppp/ppp_generic.c @@ -1121,7 +1121,7 @@ static int ppp_nl_newlink(struct net *src_net, struct net_device *dev, * the PPP unit identifer as suffix (i.e. ppp<unit_id>). This allows * userspace to infer the device name using to the PPPIOCGUNIT ioctl. */ - if (!tb[IFLA_IFNAME]) + if (!tb[IFLA_IFNAME] || !nla_len(tb[IFLA_IFNAME]) || !*(char *)nla_data(tb[IFLA_IFNAME])) conf.ifname_is_set = false;
err = ppp_dev_configure(src_net, dev, &conf);
From: Hangbin Liu liuhangbin@gmail.com
[ Upstream commit d09c548dbf3b31cb07bba562e0f452edfa01efe3 ]
When mirror/redirect a skb to a different port, the ct info should be reset for reclassification. Or the pkts will match unexpected rules. For example, with following topology and commands:
----------- | veth0 -+------- | veth1 -+------- | ------------
tc qdisc add dev veth0 clsact # The same with "action mirred egress mirror dev veth1" or "action mirred ingress redirect dev veth1" tc filter add dev veth0 egress chain 1 protocol ip flower ct_state +trk action mirred ingress mirror dev veth1 tc filter add dev veth0 egress chain 0 protocol ip flower ct_state -inv action ct commit action goto chain 1 tc qdisc add dev veth1 clsact tc filter add dev veth1 ingress chain 0 protocol ip flower ct_state +trk action drop
ping <remove ip via veth0> & tc -s filter show dev veth1 ingress
With command 'tc -s filter show', we can find the pkts were dropped on veth1.
Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct") Signed-off-by: Roi Dayan roid@nvidia.com Signed-off-by: Hangbin Liu liuhangbin@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/sched/act_mirred.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c index 8327ef9793ef..e3ff884a48c5 100644 --- a/net/sched/act_mirred.c +++ b/net/sched/act_mirred.c @@ -261,6 +261,9 @@ static int tcf_mirred_act(struct sk_buff *skb, const struct tc_action *a, goto out; }
+ /* All mirred/redirected skbs should clear previous ct info */ + nf_reset_ct(skb2); + want_ingress = tcf_mirred_act_wants_ingress(m_eaction);
expects_nh = want_ingress || !m_mac_header_xmit;
From: Md Fahad Iqbal Polash md.fahad.iqbal.polash@intel.com
[ Upstream commit a7550f8b1c9712894f9e98d6caf5f49451ebd058 ]
iavf driver should set RSS LUT and key unconditionally in reset path. Currently, the driver does not do that. This patch fixes this issue.
Fixes: 2c86ac3c7079 ("i40evf: create a generic config RSS function") Signed-off-by: Md Fahad Iqbal Polash md.fahad.iqbal.polash@intel.com Tested-by: Konrad Jankowski konrad0.jankowski@intel.com Signed-off-by: Tony Nguyen anthony.l.nguyen@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/intel/iavf/iavf_main.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c index cda9b9a8392a..dc902e371c2c 100644 --- a/drivers/net/ethernet/intel/iavf/iavf_main.c +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c @@ -1499,11 +1499,6 @@ static int iavf_reinit_interrupt_scheme(struct iavf_adapter *adapter) set_bit(__IAVF_VSI_DOWN, adapter->vsi.state);
iavf_map_rings_to_vectors(adapter); - - if (RSS_AQ(adapter)) - adapter->aq_required |= IAVF_FLAG_AQ_CONFIGURE_RSS; - else - err = iavf_init_rss(adapter); err: return err; } @@ -2179,6 +2174,14 @@ continue_reset: goto reset_err; }
+ if (RSS_AQ(adapter)) { + adapter->aq_required |= IAVF_FLAG_AQ_CONFIGURE_RSS; + } else { + err = iavf_init_rss(adapter); + if (err) + goto reset_err; + } + adapter->aq_required |= IAVF_FLAG_AQ_GET_CONFIG; adapter->aq_required |= IAVF_FLAG_AQ_MAP_VECTORS;
From: Roi Dayan roid@nvidia.com
[ Upstream commit beb7f2de5728b0bd2140a652fa51f6ad85d159f7 ]
Without this there is a warning if source files include psample.h before skbuff.h or doesn't include it at all.
Fixes: 6ae0a6286171 ("net: Introduce psample, a new genetlink channel for packet sampling") Signed-off-by: Roi Dayan roid@nvidia.com Link: https://lore.kernel.org/r/20210808065242.1522535-1-roid@nvidia.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- include/net/psample.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/include/net/psample.h b/include/net/psample.h index 68ae16bb0a4a..20a17551f790 100644 --- a/include/net/psample.h +++ b/include/net/psample.h @@ -18,6 +18,8 @@ struct psample_group *psample_group_get(struct net *net, u32 group_num); void psample_group_take(struct psample_group *group); void psample_group_put(struct psample_group *group);
+struct sk_buff; + #if IS_ENABLED(CONFIG_PSAMPLE)
void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
From: Aya Levin ayal@nvidia.com
[ Upstream commit bd37c2888ccaa5ceb9895718f6909b247cc372e0 ]
Check return value of mlx5_fw_tracer_start(), set error path and fix return value of mlx5_fw_tracer_init() accordingly.
Fixes: c71ad41ccb0c ("net/mlx5: FW tracer, events handling") Signed-off-by: Aya Levin ayal@nvidia.com Reviewed-by: Moshe Shemesh moshe@nvidia.com Reviewed-by: Tariq Toukan tariqt@nvidia.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Signed-off-by: Sasha Levin sashal@kernel.org --- .../net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c index eb2e57ff08a6..dc36b0db3722 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c @@ -1017,12 +1017,19 @@ int mlx5_fw_tracer_init(struct mlx5_fw_tracer *tracer) MLX5_NB_INIT(&tracer->nb, fw_tracer_event, DEVICE_TRACER); mlx5_eq_notifier_register(dev, &tracer->nb);
- mlx5_fw_tracer_start(tracer); - + err = mlx5_fw_tracer_start(tracer); + if (err) { + mlx5_core_warn(dev, "FWTracer: Failed to start tracer %d\n", err); + goto err_notifier_unregister; + } return 0;
+err_notifier_unregister: + mlx5_eq_notifier_unregister(dev, &tracer->nb); + mlx5_core_destroy_mkey(dev, &tracer->buff.mkey); err_dealloc_pd: mlx5_core_dealloc_pd(dev, tracer->buff.pdn); + cancel_work_sync(&tracer->read_fw_strings_work); return err; }
From: Christian Hewitt christianshewitt@gmail.com
[ Upstream commit bf33677a3c394bb8fddd48d3bbc97adf0262e045 ]
Add support for the OSD1 HDR registers so meson DRM can handle the HDR properties set by Amlogic u-boot on G12A and newer devices which result in blue/green/pink colour distortion to display output.
This takes the original patch submissions from Mathias [0] and [1] with corrections for formatting and the missing description and attribution needed for merge.
[0] https://lore.kernel.org/linux-amlogic/59dfd7e6-fc91-3d61-04c4-94e078a3188c@b... [1] https://lore.kernel.org/linux-amlogic/CAOKfEHBx_fboUqkENEMd-OC-NSrf46nto+vDL...
Fixes: 728883948b0d ("drm/meson: Add G12A Support for VIU setup") Suggested-by: Mathias Steiger mathias.steiger@googlemail.com Signed-off-by: Christian Hewitt christianshewitt@gmail.com Tested-by: Neil Armstrong narmstrong@baylibre.com Tested-by: Philip Milev milev.philip@gmail.com [narmsrong: adding missing space on second tested-by tag] Signed-off-by: Neil Armstrong narmstrong@baylibre.com Link: https://patchwork.freedesktop.org/patch/msgid/20210806094005.7136-1-christia... Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/meson/meson_registers.h | 5 +++++ drivers/gpu/drm/meson/meson_viu.c | 7 ++++++- 2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/meson/meson_registers.h b/drivers/gpu/drm/meson/meson_registers.h index 05fce48ceee0..f7da816a5562 100644 --- a/drivers/gpu/drm/meson/meson_registers.h +++ b/drivers/gpu/drm/meson/meson_registers.h @@ -590,6 +590,11 @@ #define VPP_WRAP_OSD3_MATRIX_PRE_OFFSET2 0x3dbc #define VPP_WRAP_OSD3_MATRIX_EN_CTRL 0x3dbd
+/* osd1 HDR */ +#define OSD1_HDR2_CTRL 0x38a0 +#define OSD1_HDR2_CTRL_VDIN0_HDR2_TOP_EN BIT(13) +#define OSD1_HDR2_CTRL_REG_ONLY_MAT BIT(16) + /* osd2 scaler */ #define OSD2_VSC_PHASE_STEP 0x3d00 #define OSD2_VSC_INI_PHASE 0x3d01 diff --git a/drivers/gpu/drm/meson/meson_viu.c b/drivers/gpu/drm/meson/meson_viu.c index 68cf2c2eca5f..33698814c022 100644 --- a/drivers/gpu/drm/meson/meson_viu.c +++ b/drivers/gpu/drm/meson/meson_viu.c @@ -356,9 +356,14 @@ void meson_viu_init(struct meson_drm *priv) if (meson_vpu_is_compatible(priv, VPU_COMPATIBLE_GXM) || meson_vpu_is_compatible(priv, VPU_COMPATIBLE_GXL)) meson_viu_load_matrix(priv); - else if (meson_vpu_is_compatible(priv, VPU_COMPATIBLE_G12A)) + else if (meson_vpu_is_compatible(priv, VPU_COMPATIBLE_G12A)) { meson_viu_set_g12a_osd1_matrix(priv, RGB709_to_YUV709l_coeff, true); + /* fix green/pink color distortion from vendor u-boot */ + writel_bits_relaxed(OSD1_HDR2_CTRL_REG_ONLY_MAT | + OSD1_HDR2_CTRL_VDIN0_HDR2_TOP_EN, 0, + priv->io_base + _REG(OSD1_HDR2_CTRL)); + }
/* Initialize OSD1 fifo control register */ reg = VIU_OSD_DDR_PRIORITY_URGENT |
From: Ben Hutchings ben.hutchings@mind.be
[ Upstream commit c34f674c8875235725c3ef86147a627f165d23b4 ]
ksz_read64() currently does some dubious byte-swapping on the two halves of a 64-bit register, and then only returns the high bits. Replace this with a straightforward expression.
Fixes: e66f840c08a2 ("net: dsa: ksz: Add Microchip KSZ8795 DSA driver") Signed-off-by: Ben Hutchings ben.hutchings@mind.be Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/dsa/microchip/ksz_common.h | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/drivers/net/dsa/microchip/ksz_common.h b/drivers/net/dsa/microchip/ksz_common.h index 061142b183cb..d6013410dc88 100644 --- a/drivers/net/dsa/microchip/ksz_common.h +++ b/drivers/net/dsa/microchip/ksz_common.h @@ -215,12 +215,8 @@ static inline int ksz_read64(struct ksz_device *dev, u32 reg, u64 *val) int ret;
ret = regmap_bulk_read(dev->regmap[2], reg, value, 2); - if (!ret) { - /* Ick! ToDo: Add 64bit R/W to regmap on 32bit systems */ - value[0] = swab32(value[0]); - value[1] = swab32(value[1]); - *val = swab64((u64)*value); - } + if (!ret) + *val = (u64)value[0] << 32 | value[1];
return ret; }
From: Takeshi Misawa jeliantsurux@gmail.com
[ Upstream commit 1090340f7ee53e824fd4eef66a4855d548110c5b ]
If IEEE-802.15.4-RAW is closed before receive skb, skb is leaked. Fix this, by freeing sk_receive_queue in sk->sk_destruct().
syzbot report: BUG: memory leak unreferenced object 0xffff88810f644600 (size 232): comm "softirq", pid 0, jiffies 4294967032 (age 81.270s) hex dump (first 32 bytes): 10 7d 4b 12 81 88 ff ff 10 7d 4b 12 81 88 ff ff .}K......}K..... 00 00 00 00 00 00 00 00 40 7c 4b 12 81 88 ff ff ........@|K..... backtrace: [<ffffffff83651d4a>] skb_clone+0xaa/0x2b0 net/core/skbuff.c:1496 [<ffffffff83fe1b80>] ieee802154_raw_deliver net/ieee802154/socket.c:369 [inline] [<ffffffff83fe1b80>] ieee802154_rcv+0x100/0x340 net/ieee802154/socket.c:1070 [<ffffffff8367cc7a>] __netif_receive_skb_one_core+0x6a/0xa0 net/core/dev.c:5384 [<ffffffff8367cd07>] __netif_receive_skb+0x27/0xa0 net/core/dev.c:5498 [<ffffffff8367cdd9>] netif_receive_skb_internal net/core/dev.c:5603 [inline] [<ffffffff8367cdd9>] netif_receive_skb+0x59/0x260 net/core/dev.c:5662 [<ffffffff83fe6302>] ieee802154_deliver_skb net/mac802154/rx.c:29 [inline] [<ffffffff83fe6302>] ieee802154_subif_frame net/mac802154/rx.c:102 [inline] [<ffffffff83fe6302>] __ieee802154_rx_handle_packet net/mac802154/rx.c:212 [inline] [<ffffffff83fe6302>] ieee802154_rx+0x612/0x620 net/mac802154/rx.c:284 [<ffffffff83fe59a6>] ieee802154_tasklet_handler+0x86/0xa0 net/mac802154/main.c:35 [<ffffffff81232aab>] tasklet_action_common.constprop.0+0x5b/0x100 kernel/softirq.c:557 [<ffffffff846000bf>] __do_softirq+0xbf/0x2ab kernel/softirq.c:345 [<ffffffff81232f4c>] do_softirq kernel/softirq.c:248 [inline] [<ffffffff81232f4c>] do_softirq+0x5c/0x80 kernel/softirq.c:235 [<ffffffff81232fc1>] __local_bh_enable_ip+0x51/0x60 kernel/softirq.c:198 [<ffffffff8367a9a4>] local_bh_enable include/linux/bottom_half.h:32 [inline] [<ffffffff8367a9a4>] rcu_read_unlock_bh include/linux/rcupdate.h:745 [inline] [<ffffffff8367a9a4>] __dev_queue_xmit+0x7f4/0xf60 net/core/dev.c:4221 [<ffffffff83fe2db4>] raw_sendmsg+0x1f4/0x2b0 net/ieee802154/socket.c:295 [<ffffffff8363af16>] sock_sendmsg_nosec net/socket.c:654 [inline] [<ffffffff8363af16>] sock_sendmsg+0x56/0x80 net/socket.c:674 [<ffffffff8363deec>] __sys_sendto+0x15c/0x200 net/socket.c:1977 [<ffffffff8363dfb6>] __do_sys_sendto net/socket.c:1989 [inline] [<ffffffff8363dfb6>] __se_sys_sendto net/socket.c:1985 [inline] [<ffffffff8363dfb6>] __x64_sys_sendto+0x26/0x30 net/socket.c:1985
Fixes: 9ec767160357 ("net: add IEEE 802.15.4 socket family implementation") Reported-and-tested-by: syzbot+1f68113fa907bf0695a8@syzkaller.appspotmail.com Signed-off-by: Takeshi Misawa jeliantsurux@gmail.com Acked-by: Alexander Aring aahringo@redhat.com Link: https://lore.kernel.org/r/20210805075414.GA15796@DESKTOP Signed-off-by: Stefan Schmidt stefan@datenfreihafen.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/ieee802154/socket.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/ieee802154/socket.c b/net/ieee802154/socket.c index d93d4531aa9b..9a675ba0bf0a 100644 --- a/net/ieee802154/socket.c +++ b/net/ieee802154/socket.c @@ -992,6 +992,11 @@ static const struct proto_ops ieee802154_dgram_ops = { #endif };
+static void ieee802154_sock_destruct(struct sock *sk) +{ + skb_queue_purge(&sk->sk_receive_queue); +} + /* Create a socket. Initialise the socket, blank the addresses * set the state. */ @@ -1032,7 +1037,7 @@ static int ieee802154_create(struct net *net, struct socket *sock, sock->ops = ops;
sock_init_data(sock, sk); - /* FIXME: sk->sk_destruct */ + sk->sk_destruct = ieee802154_sock_destruct; sk->sk_family = PF_IEEE802154;
/* Checksums on by default */
From: Eric Dumazet edumazet@google.com
[ Upstream commit 4a2b285e7e103d4d6c6ed3e5052a0ff74a5d7f15 ]
Fix the data-race reported by syzbot [1] Issue here is that igmp_ifc_timer_expire() can update in_dev->mr_ifc_count while another change just occured from another context.
in_dev->mr_ifc_count is only 8bit wide, so the race had little consequences.
[1] BUG: KCSAN: data-race in igmp_ifc_event / igmp_ifc_timer_expire
write to 0xffff8881051e3062 of 1 bytes by task 12547 on cpu 0: igmp_ifc_event+0x1d5/0x290 net/ipv4/igmp.c:821 igmp_group_added+0x462/0x490 net/ipv4/igmp.c:1356 ____ip_mc_inc_group+0x3ff/0x500 net/ipv4/igmp.c:1461 __ip_mc_join_group+0x24d/0x2c0 net/ipv4/igmp.c:2199 ip_mc_join_group_ssm+0x20/0x30 net/ipv4/igmp.c:2218 do_ip_setsockopt net/ipv4/ip_sockglue.c:1285 [inline] ip_setsockopt+0x1827/0x2a80 net/ipv4/ip_sockglue.c:1423 tcp_setsockopt+0x8c/0xa0 net/ipv4/tcp.c:3657 sock_common_setsockopt+0x5d/0x70 net/core/sock.c:3362 __sys_setsockopt+0x18f/0x200 net/socket.c:2159 __do_sys_setsockopt net/socket.c:2170 [inline] __se_sys_setsockopt net/socket.c:2167 [inline] __x64_sys_setsockopt+0x62/0x70 net/socket.c:2167 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff8881051e3062 of 1 bytes by interrupt on cpu 1: igmp_ifc_timer_expire+0x706/0xa30 net/ipv4/igmp.c:808 call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1419 expire_timers+0x135/0x250 kernel/time/timer.c:1464 __run_timers+0x358/0x420 kernel/time/timer.c:1732 run_timer_softirq+0x19/0x30 kernel/time/timer.c:1745 __do_softirq+0x12c/0x26e kernel/softirq.c:558 invoke_softirq kernel/softirq.c:432 [inline] __irq_exit_rcu+0x9a/0xb0 kernel/softirq.c:636 sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1100 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638 console_unlock+0x8e8/0xb30 kernel/printk/printk.c:2646 vprintk_emit+0x125/0x3d0 kernel/printk/printk.c:2174 vprintk_default+0x22/0x30 kernel/printk/printk.c:2185 vprintk+0x15a/0x170 kernel/printk/printk_safe.c:392 printk+0x62/0x87 kernel/printk/printk.c:2216 selinux_netlink_send+0x399/0x400 security/selinux/hooks.c:6041 security_netlink_send+0x42/0x90 security/security.c:2070 netlink_sendmsg+0x59e/0x7c0 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:703 [inline] sock_sendmsg net/socket.c:723 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2392 ___sys_sendmsg net/socket.c:2446 [inline] __sys_sendmsg+0x1ed/0x270 net/socket.c:2475 __do_sys_sendmsg net/socket.c:2484 [inline] __se_sys_sendmsg net/socket.c:2482 [inline] __x64_sys_sendmsg+0x42/0x50 net/socket.c:2482 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x01 -> 0x02
Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 12539 Comm: syz-executor.1 Not tainted 5.14.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: syzbot syzkaller@googlegroups.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/ipv4/igmp.c | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-)
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c index c8cbdc4d5cbc..cfa31c34b5bb 100644 --- a/net/ipv4/igmp.c +++ b/net/ipv4/igmp.c @@ -805,10 +805,17 @@ static void igmp_gq_timer_expire(struct timer_list *t) static void igmp_ifc_timer_expire(struct timer_list *t) { struct in_device *in_dev = from_timer(in_dev, t, mr_ifc_timer); + u8 mr_ifc_count;
igmpv3_send_cr(in_dev); - if (in_dev->mr_ifc_count) { - in_dev->mr_ifc_count--; +restart: + mr_ifc_count = READ_ONCE(in_dev->mr_ifc_count); + + if (mr_ifc_count) { + if (cmpxchg(&in_dev->mr_ifc_count, + mr_ifc_count, + mr_ifc_count - 1) != mr_ifc_count) + goto restart; igmp_ifc_start_timer(in_dev, unsolicited_report_interval(in_dev)); } @@ -820,7 +827,7 @@ static void igmp_ifc_event(struct in_device *in_dev) struct net *net = dev_net(in_dev->dev); if (IGMP_V1_SEEN(in_dev) || IGMP_V2_SEEN(in_dev)) return; - in_dev->mr_ifc_count = in_dev->mr_qrv ?: net->ipv4.sysctl_igmp_qrv; + WRITE_ONCE(in_dev->mr_ifc_count, in_dev->mr_qrv ?: net->ipv4.sysctl_igmp_qrv); igmp_ifc_start_timer(in_dev, 1); }
@@ -959,7 +966,7 @@ static bool igmp_heard_query(struct in_device *in_dev, struct sk_buff *skb, in_dev->mr_qri; } /* cancel the interface change timer */ - in_dev->mr_ifc_count = 0; + WRITE_ONCE(in_dev->mr_ifc_count, 0); if (del_timer(&in_dev->mr_ifc_timer)) __in_dev_put(in_dev); /* clear deleted report items */ @@ -1726,7 +1733,7 @@ void ip_mc_down(struct in_device *in_dev) igmp_group_dropped(pmc);
#ifdef CONFIG_IP_MULTICAST - in_dev->mr_ifc_count = 0; + WRITE_ONCE(in_dev->mr_ifc_count, 0); if (del_timer(&in_dev->mr_ifc_timer)) __in_dev_put(in_dev); in_dev->mr_gq_running = 0; @@ -1943,7 +1950,7 @@ static int ip_mc_del_src(struct in_device *in_dev, __be32 *pmca, int sfmode, pmc->sfmode = MCAST_INCLUDE; #ifdef CONFIG_IP_MULTICAST pmc->crcount = in_dev->mr_qrv ?: net->ipv4.sysctl_igmp_qrv; - in_dev->mr_ifc_count = pmc->crcount; + WRITE_ONCE(in_dev->mr_ifc_count, pmc->crcount); for (psf = pmc->sources; psf; psf = psf->sf_next) psf->sf_crcount = 0; igmp_ifc_event(pmc->interface); @@ -2122,7 +2129,7 @@ static int ip_mc_add_src(struct in_device *in_dev, __be32 *pmca, int sfmode, /* else no filters; keep old mode for reports */
pmc->crcount = in_dev->mr_qrv ?: net->ipv4.sysctl_igmp_qrv; - in_dev->mr_ifc_count = pmc->crcount; + WRITE_ONCE(in_dev->mr_ifc_count, pmc->crcount); for (psf = pmc->sources; psf; psf = psf->sf_next) psf->sf_crcount = 0; igmp_ifc_event(in_dev);
From: Vladimir Oltean vladimir.oltean@nxp.com
[ Upstream commit ada2fee185d8145afb89056558bb59545b9dbdd0 ]
rtnl_fdb_dump() has logic to split a dump of PF_BRIDGE neighbors into multiple netlink skbs if the buffer provided by user space is too small (one buffer will typically handle a few hundred FDB entries).
When the current buffer becomes full, nlmsg_put() in dsa_slave_port_fdb_do_dump() returns -EMSGSIZE and DSA saves the index of the last dumped FDB entry, returns to rtnl_fdb_dump() up to that point, and then the dump resumes on the same port with a new skb, and FDB entries up to the saved index are simply skipped.
Since dsa_slave_port_fdb_do_dump() is pointed to by the "cb" passed to drivers, then drivers must check for the -EMSGSIZE error code returned by it. Otherwise, when a netlink skb becomes full, DSA will no longer save newly dumped FDB entries to it, but the driver will continue dumping. So FDB entries will be missing from the dump.
Fix the broken backpressure by propagating the "cb" return code and allow rtnl_fdb_dump() to restart the FDB dump with a new skb.
Fixes: ab335349b852 ("net: dsa: lan9303: Add port_fast_age and port_fdb_dump methods") Signed-off-by: Vladimir Oltean vladimir.oltean@nxp.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/dsa/lan9303-core.c | 34 +++++++++++++++++++--------------- 1 file changed, 19 insertions(+), 15 deletions(-)
diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c index bbec86b9418e..19d1f1c51f97 100644 --- a/drivers/net/dsa/lan9303-core.c +++ b/drivers/net/dsa/lan9303-core.c @@ -557,12 +557,12 @@ static int lan9303_alr_make_entry_raw(struct lan9303 *chip, u32 dat0, u32 dat1) return 0; }
-typedef void alr_loop_cb_t(struct lan9303 *chip, u32 dat0, u32 dat1, - int portmap, void *ctx); +typedef int alr_loop_cb_t(struct lan9303 *chip, u32 dat0, u32 dat1, + int portmap, void *ctx);
-static void lan9303_alr_loop(struct lan9303 *chip, alr_loop_cb_t *cb, void *ctx) +static int lan9303_alr_loop(struct lan9303 *chip, alr_loop_cb_t *cb, void *ctx) { - int i; + int ret = 0, i;
mutex_lock(&chip->alr_mutex); lan9303_write_switch_reg(chip, LAN9303_SWE_ALR_CMD, @@ -582,13 +582,17 @@ static void lan9303_alr_loop(struct lan9303 *chip, alr_loop_cb_t *cb, void *ctx) LAN9303_ALR_DAT1_PORT_BITOFFS; portmap = alrport_2_portmap[alrport];
- cb(chip, dat0, dat1, portmap, ctx); + ret = cb(chip, dat0, dat1, portmap, ctx); + if (ret) + break;
lan9303_write_switch_reg(chip, LAN9303_SWE_ALR_CMD, LAN9303_ALR_CMD_GET_NEXT); lan9303_write_switch_reg(chip, LAN9303_SWE_ALR_CMD, 0); } mutex_unlock(&chip->alr_mutex); + + return ret; }
static void alr_reg_to_mac(u32 dat0, u32 dat1, u8 mac[6]) @@ -606,18 +610,20 @@ struct del_port_learned_ctx { };
/* Clear learned (non-static) entry on given port */ -static void alr_loop_cb_del_port_learned(struct lan9303 *chip, u32 dat0, - u32 dat1, int portmap, void *ctx) +static int alr_loop_cb_del_port_learned(struct lan9303 *chip, u32 dat0, + u32 dat1, int portmap, void *ctx) { struct del_port_learned_ctx *del_ctx = ctx; int port = del_ctx->port;
if (((BIT(port) & portmap) == 0) || (dat1 & LAN9303_ALR_DAT1_STATIC)) - return; + return 0;
/* learned entries has only one port, we can just delete */ dat1 &= ~LAN9303_ALR_DAT1_VALID; /* delete entry */ lan9303_alr_make_entry_raw(chip, dat0, dat1); + + return 0; }
struct port_fdb_dump_ctx { @@ -626,19 +632,19 @@ struct port_fdb_dump_ctx { dsa_fdb_dump_cb_t *cb; };
-static void alr_loop_cb_fdb_port_dump(struct lan9303 *chip, u32 dat0, - u32 dat1, int portmap, void *ctx) +static int alr_loop_cb_fdb_port_dump(struct lan9303 *chip, u32 dat0, + u32 dat1, int portmap, void *ctx) { struct port_fdb_dump_ctx *dump_ctx = ctx; u8 mac[ETH_ALEN]; bool is_static;
if ((BIT(dump_ctx->port) & portmap) == 0) - return; + return 0;
alr_reg_to_mac(dat0, dat1, mac); is_static = !!(dat1 & LAN9303_ALR_DAT1_STATIC); - dump_ctx->cb(mac, 0, is_static, dump_ctx->data); + return dump_ctx->cb(mac, 0, is_static, dump_ctx->data); }
/* Set a static ALR entry. Delete entry if port_map is zero */ @@ -1210,9 +1216,7 @@ static int lan9303_port_fdb_dump(struct dsa_switch *ds, int port, };
dev_dbg(chip->dev, "%s(%d)\n", __func__, port); - lan9303_alr_loop(chip, alr_loop_cb_fdb_port_dump, &dump_ctx); - - return 0; + return lan9303_alr_loop(chip, alr_loop_cb_fdb_port_dump, &dump_ctx); }
static int lan9303_port_mdb_prepare(struct dsa_switch *ds, int port,
From: Vladimir Oltean vladimir.oltean@nxp.com
[ Upstream commit 871a73a1c8f55da0a3db234e9dd816ea4fd546f2 ]
rtnl_fdb_dump() has logic to split a dump of PF_BRIDGE neighbors into multiple netlink skbs if the buffer provided by user space is too small (one buffer will typically handle a few hundred FDB entries).
When the current buffer becomes full, nlmsg_put() in dsa_slave_port_fdb_do_dump() returns -EMSGSIZE and DSA saves the index of the last dumped FDB entry, returns to rtnl_fdb_dump() up to that point, and then the dump resumes on the same port with a new skb, and FDB entries up to the saved index are simply skipped.
Since dsa_slave_port_fdb_do_dump() is pointed to by the "cb" passed to drivers, then drivers must check for the -EMSGSIZE error code returned by it. Otherwise, when a netlink skb becomes full, DSA will no longer save newly dumped FDB entries to it, but the driver will continue dumping. So FDB entries will be missing from the dump.
Fix the broken backpressure by propagating the "cb" return code and allow rtnl_fdb_dump() to restart the FDB dump with a new skb.
Fixes: 58c59ef9e930 ("net: dsa: lantiq: Add Forwarding Database access") Signed-off-by: Vladimir Oltean vladimir.oltean@nxp.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/dsa/lantiq_gswip.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/drivers/net/dsa/lantiq_gswip.c b/drivers/net/dsa/lantiq_gswip.c index dc75e798dbff..af3d56636a07 100644 --- a/drivers/net/dsa/lantiq_gswip.c +++ b/drivers/net/dsa/lantiq_gswip.c @@ -1399,11 +1399,17 @@ static int gswip_port_fdb_dump(struct dsa_switch *ds, int port, addr[1] = mac_bridge.key[2] & 0xff; addr[0] = (mac_bridge.key[2] >> 8) & 0xff; if (mac_bridge.val[1] & GSWIP_TABLE_MAC_BRIDGE_STATIC) { - if (mac_bridge.val[0] & BIT(port)) - cb(addr, 0, true, data); + if (mac_bridge.val[0] & BIT(port)) { + err = cb(addr, 0, true, data); + if (err) + return err; + } } else { - if (((mac_bridge.val[0] & GENMASK(7, 4)) >> 4) == port) - cb(addr, 0, false, data); + if (((mac_bridge.val[0] & GENMASK(7, 4)) >> 4) == port) { + err = cb(addr, 0, false, data); + if (err) + return err; + } } } return 0;
From: Vladimir Oltean vladimir.oltean@nxp.com
[ Upstream commit 21b52fed928e96d2f75d2f6aa9eac7a4b0b55d22 ]
rtnl_fdb_dump() has logic to split a dump of PF_BRIDGE neighbors into multiple netlink skbs if the buffer provided by user space is too small (one buffer will typically handle a few hundred FDB entries).
When the current buffer becomes full, nlmsg_put() in dsa_slave_port_fdb_do_dump() returns -EMSGSIZE and DSA saves the index of the last dumped FDB entry, returns to rtnl_fdb_dump() up to that point, and then the dump resumes on the same port with a new skb, and FDB entries up to the saved index are simply skipped.
Since dsa_slave_port_fdb_do_dump() is pointed to by the "cb" passed to drivers, then drivers must check for the -EMSGSIZE error code returned by it. Otherwise, when a netlink skb becomes full, DSA will no longer save newly dumped FDB entries to it, but the driver will continue dumping. So FDB entries will be missing from the dump.
Fix the broken backpressure by propagating the "cb" return code and allow rtnl_fdb_dump() to restart the FDB dump with a new skb.
Fixes: 291d1e72b756 ("net: dsa: sja1105: Add support for FDB and MDB management") Signed-off-by: Vladimir Oltean vladimir.oltean@nxp.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/dsa/sja1105/sja1105_main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/dsa/sja1105/sja1105_main.c b/drivers/net/dsa/sja1105/sja1105_main.c index a07d8051ec3e..eab861352bf2 100644 --- a/drivers/net/dsa/sja1105/sja1105_main.c +++ b/drivers/net/dsa/sja1105/sja1105_main.c @@ -1312,7 +1312,9 @@ static int sja1105_fdb_dump(struct dsa_switch *ds, int port, /* We need to hide the dsa_8021q VLANs from the user. */ if (!dsa_port_is_vlan_filtering(&ds->ports[port])) l2_lookup.vlanid = 0; - cb(macaddr, l2_lookup.vlanid, l2_lookup.lockeds, data); + rc = cb(macaddr, l2_lookup.vlanid, l2_lookup.lockeds, data); + if (rc) + return rc; } return 0; }
From: Yang Yingliang yangyingliang@huawei.com
[ Upstream commit 519133debcc19f5c834e7e28480b60bdc234fe02 ]
I got a memleak report:
BUG: memory leak unreferenced object 0x607ee521a658 (size 240): comm "syz-executor.0", pid 955, jiffies 4294780569 (age 16.449s) hex dump (first 32 bytes, cpu 1): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000d830ea5a>] br_multicast_add_port+0x1c2/0x300 net/bridge/br_multicast.c:1693 [<00000000274d9a71>] new_nbp net/bridge/br_if.c:435 [inline] [<00000000274d9a71>] br_add_if+0x670/0x1740 net/bridge/br_if.c:611 [<0000000012ce888e>] do_set_master net/core/rtnetlink.c:2513 [inline] [<0000000012ce888e>] do_set_master+0x1aa/0x210 net/core/rtnetlink.c:2487 [<0000000099d1cafc>] __rtnl_newlink+0x1095/0x13e0 net/core/rtnetlink.c:3457 [<00000000a01facc0>] rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3488 [<00000000acc9186c>] rtnetlink_rcv_msg+0x369/0xa10 net/core/rtnetlink.c:5550 [<00000000d4aabb9c>] netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2504 [<00000000bc2e12a3>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] [<00000000bc2e12a3>] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1340 [<00000000e4dc2d0e>] netlink_sendmsg+0x789/0xc70 net/netlink/af_netlink.c:1929 [<000000000d22c8b3>] sock_sendmsg_nosec net/socket.c:654 [inline] [<000000000d22c8b3>] sock_sendmsg+0x139/0x170 net/socket.c:674 [<00000000e281417a>] ____sys_sendmsg+0x658/0x7d0 net/socket.c:2350 [<00000000237aa2ab>] ___sys_sendmsg+0xf8/0x170 net/socket.c:2404 [<000000004f2dc381>] __sys_sendmsg+0xd3/0x190 net/socket.c:2433 [<0000000005feca6c>] do_syscall_64+0x37/0x90 arch/x86/entry/common.c:47 [<000000007304477d>] entry_SYSCALL_64_after_hwframe+0x44/0xae
On error path of br_add_if(), p->mcast_stats allocated in new_nbp() need be freed, or it will be leaked.
Fixes: 1080ab95e3c7 ("net: bridge: add support for IGMP/MLD stats and export them via netlink") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Acked-by: Nikolay Aleksandrov nikolay@nvidia.com Link: https://lore.kernel.org/r/20210809132023.978546-1-yangyingliang@huawei.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/bridge/br_if.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c index bec20dbf6f60..e2a999890d05 100644 --- a/net/bridge/br_if.c +++ b/net/bridge/br_if.c @@ -599,6 +599,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
err = dev_set_allmulti(dev, 1); if (err) { + br_multicast_del_port(p); kfree(p); /* kobject not yet init'd, manually free */ goto err1; } @@ -712,6 +713,7 @@ err4: err3: sysfs_remove_link(br->ifobj, p->dev->name); err2: + br_multicast_del_port(p); kobject_put(&p->kobj); dev_set_allmulti(dev, -1); err1:
From: Willy Tarreau w@1wt.eu
[ Upstream commit 6922110d152e56d7569616b45a1f02876cf3eb9f ]
After migrating my laptop from 4.19-LTS to 5.4-LTS a while ago I noticed that my Ethernet port to which a bond and a VLAN interface are attached appeared to remain up after resuming from suspend with the cable unplugged (and that problem still persists with 5.10-LTS).
It happens that the following happens:
- the network driver (e1000e here) prepares to suspend, calls e1000e_down() which calls netif_carrier_off() to signal that the link is going down. - netif_carrier_off() adds a link_watch event to the list of events for this device - the device is completely stopped. - the machine suspends - the cable is unplugged and the machine brought to another location - the machine is resumed - the queued linkwatch events are processed for the device - the device doesn't yet have the __LINK_STATE_PRESENT bit and its events are silently dropped - the device is resumed with its link down - the upper VLAN and bond interfaces are never notified that the link had been turned down and remain up - the only way to provoke a change is to physically connect the machine to a port and possibly unplug it.
The state after resume looks like this: $ ip -br li | egrep 'bond|eth' bond0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> eth0 DOWN e8:6a:64:64:64:64 <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> eth0.2@eth0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP>
Placing an explicit call to netdev_state_change() either in the suspend or the resume code in the NIC driver worked around this but the solution is not satisfying.
The issue in fact really is in link_watch that loses events while it ought not to. It happens that the test for the device being present was added by commit 124eee3f6955 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev") in 4.20 to avoid an access to devices that are not present.
Instead of dropping events, this patch proceeds slightly differently by postponing their handling so that they happen after the device is fully resumed.
Fixes: 124eee3f6955 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev") Link: https://lists.openwall.net/netdev/2018/03/15/62 Cc: Heiner Kallweit hkallweit1@gmail.com Cc: Geert Uytterhoeven geert+renesas@glider.be Cc: Florian Fainelli f.fainelli@gmail.com Signed-off-by: Willy Tarreau w@1wt.eu Link: https://lore.kernel.org/r/20210809160628.22623-1-w@1wt.eu Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/core/link_watch.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/core/link_watch.c b/net/core/link_watch.c index f153e0601838..35b0e39030da 100644 --- a/net/core/link_watch.c +++ b/net/core/link_watch.c @@ -150,7 +150,7 @@ static void linkwatch_do_dev(struct net_device *dev) clear_bit(__LINK_STATE_LINKWATCH_PENDING, &dev->state);
rfc2863_policy(dev); - if (dev->flags & IFF_UP && netif_device_present(dev)) { + if (dev->flags & IFF_UP) { if (netif_carrier_ok(dev)) dev_activate(dev); else @@ -196,7 +196,8 @@ static void __linkwatch_run_queue(int urgent_only) dev = list_first_entry(&wrk, struct net_device, link_watch_list); list_del_init(&dev->link_watch_list);
- if (urgent_only && !linkwatch_urgent_event(dev)) { + if (!netif_device_present(dev) || + (urgent_only && !linkwatch_urgent_event(dev))) { list_add_tail(&dev->link_watch_list, &lweventlist); continue; }
From: Neal Cardwell ncardwell@google.com
[ Upstream commit 6de035fec045f8ae5ee5f3a02373a18b939e91fb ]
Currently if BBR congestion control is initialized after more than 2B packets have been delivered, depending on the phase of the tp->delivered counter the tracking of BBR round trips can get stuck.
The bug arises because if tp->delivered is between 2^31 and 2^32 at the time the BBR congestion control module is initialized, then the initialization of bbr->next_rtt_delivered to 0 will cause the logic to believe that the end of the round trip is still billions of packets in the future. More specifically, the following check will fail repeatedly:
!before(rs->prior_delivered, bbr->next_rtt_delivered)
and thus the connection will take up to 2B packets delivered before that check will pass and the connection will set:
bbr->round_start = 1;
This could cause many mechanisms in BBR to fail to trigger, for example bbr_check_full_bw_reached() would likely never exit STARTUP.
This bug is 5 years old and has not been observed, and as a practical matter this would likely rarely trigger, since it would require transferring at least 2B packets, or likely more than 3 terabytes of data, before switching congestion control algorithms to BBR.
This patch is a stable candidate for kernels as far back as v4.9, when tcp_bbr.c was added.
Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell ncardwell@google.com Reviewed-by: Yuchung Cheng ycheng@google.com Reviewed-by: Kevin Yang yyd@google.com Reviewed-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/20210811024056.235161-1-ncardwell@google.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/ipv4/tcp_bbr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 6ea3dc2e4219..6274462b86b4 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -1041,7 +1041,7 @@ static void bbr_init(struct sock *sk) bbr->prior_cwnd = 0; tp->snd_ssthresh = TCP_INFINITE_SSTHRESH; bbr->rtt_cnt = 0; - bbr->next_rtt_delivered = 0; + bbr->next_rtt_delivered = tp->delivered; bbr->prev_ca_state = TCP_CA_Open; bbr->packet_conservation = 0;
From: Eric Dumazet edumazet@google.com
[ Upstream commit b69dd5b3780a7298bd893816a09da751bc0636f7 ]
Some arches support cmpxchg() on 4-byte and 8-byte only. Increase mr_ifc_count width to 32bit to fix this problem.
Fixes: 4a2b285e7e10 ("net: igmp: fix data-race in igmp_ifc_timer_expire()") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: Guenter Roeck linux@roeck-us.net Link: https://lore.kernel.org/r/20210811195715.3684218-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/inetdevice.h | 2 +- net/ipv4/igmp.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h index 3515ca64e638..b68fca08be27 100644 --- a/include/linux/inetdevice.h +++ b/include/linux/inetdevice.h @@ -41,7 +41,7 @@ struct in_device { unsigned long mr_qri; /* Query Response Interval */ unsigned char mr_qrv; /* Query Robustness Variable */ unsigned char mr_gq_running; - unsigned char mr_ifc_count; + u32 mr_ifc_count; struct timer_list mr_gq_timer; /* general query timer */ struct timer_list mr_ifc_timer; /* interface change timer */
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c index cfa31c34b5bb..d2b1ae83f258 100644 --- a/net/ipv4/igmp.c +++ b/net/ipv4/igmp.c @@ -805,7 +805,7 @@ static void igmp_gq_timer_expire(struct timer_list *t) static void igmp_ifc_timer_expire(struct timer_list *t) { struct in_device *in_dev = from_timer(in_dev, t, mr_ifc_timer); - u8 mr_ifc_count; + u32 mr_ifc_count;
igmpv3_send_cr(in_dev); restart:
From: Maximilian Heyne mheyne@amazon.de
[ Upstream commit 88ca2521bd5b4e8b83743c01a2d4cb09325b51e9 ]
There is a TOCTOU issue in set_evtchn_to_irq. Rows in the evtchn_to_irq mapping are lazily allocated in this function. The check whether the row is already present and the row initialization is not synchronized. Two threads can at the same time allocate a new row for evtchn_to_irq and add the irq mapping to the their newly allocated row. One thread will overwrite what the other has set for evtchn_to_irq[row] and therefore the irq mapping is lost. This will trigger a BUG_ON later in bind_evtchn_to_cpu:
INFO: pci 0000:1a:15.4: [1d0f:8061] type 00 class 0x010802 INFO: nvme 0000:1a:12.1: enabling device (0000 -> 0002) INFO: nvme nvme77: 1/0/0 default/read/poll queues CRIT: kernel BUG at drivers/xen/events/events_base.c:427! WARN: invalid opcode: 0000 [#1] SMP NOPTI WARN: Workqueue: nvme-reset-wq nvme_reset_work [nvme] WARN: RIP: e030:bind_evtchn_to_cpu+0xc2/0xd0 WARN: Call Trace: WARN: set_affinity_irq+0x121/0x150 WARN: irq_do_set_affinity+0x37/0xe0 WARN: irq_setup_affinity+0xf6/0x170 WARN: irq_startup+0x64/0xe0 WARN: __setup_irq+0x69e/0x740 WARN: ? request_threaded_irq+0xad/0x160 WARN: request_threaded_irq+0xf5/0x160 WARN: ? nvme_timeout+0x2f0/0x2f0 [nvme] WARN: pci_request_irq+0xa9/0xf0 WARN: ? pci_alloc_irq_vectors_affinity+0xbb/0x130 WARN: queue_request_irq+0x4c/0x70 [nvme] WARN: nvme_reset_work+0x82d/0x1550 [nvme] WARN: ? check_preempt_wakeup+0x14f/0x230 WARN: ? check_preempt_curr+0x29/0x80 WARN: ? nvme_irq_check+0x30/0x30 [nvme] WARN: process_one_work+0x18e/0x3c0 WARN: worker_thread+0x30/0x3a0 WARN: ? process_one_work+0x3c0/0x3c0 WARN: kthread+0x113/0x130 WARN: ? kthread_park+0x90/0x90 WARN: ret_from_fork+0x3a/0x50
This patch sets evtchn_to_irq rows via a cmpxchg operation so that they will be set only once. The row is now cleared before writing it to evtchn_to_irq in order to not create a race once the row is visible for other threads.
While at it, do not require the page to be zeroed, because it will be overwritten with -1's in clear_evtchn_to_irq_row anyway.
Signed-off-by: Maximilian Heyne mheyne@amazon.de Fixes: d0b075ffeede ("xen/events: Refactor evtchn_to_irq array to be dynamically allocated") Link: https://lore.kernel.org/r/20210812130930.127134-1-mheyne@amazon.de Reviewed-by: Boris Ostrovsky boris.ostrovsky@oracle.com Signed-off-by: Boris Ostrovsky boris.ostrovsky@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/xen/events/events_base.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-)
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index de825df4abf6..87cfadd70d0d 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -134,12 +134,12 @@ static void disable_dynirq(struct irq_data *data);
static DEFINE_PER_CPU(unsigned int, irq_epoch);
-static void clear_evtchn_to_irq_row(unsigned row) +static void clear_evtchn_to_irq_row(int *evtchn_row) { unsigned col;
for (col = 0; col < EVTCHN_PER_ROW; col++) - WRITE_ONCE(evtchn_to_irq[row][col], -1); + WRITE_ONCE(evtchn_row[col], -1); }
static void clear_evtchn_to_irq_all(void) @@ -149,7 +149,7 @@ static void clear_evtchn_to_irq_all(void) for (row = 0; row < EVTCHN_ROW(xen_evtchn_max_channels()); row++) { if (evtchn_to_irq[row] == NULL) continue; - clear_evtchn_to_irq_row(row); + clear_evtchn_to_irq_row(evtchn_to_irq[row]); } }
@@ -157,6 +157,7 @@ static int set_evtchn_to_irq(unsigned evtchn, unsigned irq) { unsigned row; unsigned col; + int *evtchn_row;
if (evtchn >= xen_evtchn_max_channels()) return -EINVAL; @@ -169,11 +170,18 @@ static int set_evtchn_to_irq(unsigned evtchn, unsigned irq) if (irq == -1) return 0;
- evtchn_to_irq[row] = (int *)get_zeroed_page(GFP_KERNEL); - if (evtchn_to_irq[row] == NULL) + evtchn_row = (int *) __get_free_pages(GFP_KERNEL, 0); + if (evtchn_row == NULL) return -ENOMEM;
- clear_evtchn_to_irq_row(row); + clear_evtchn_to_irq_row(evtchn_row); + + /* + * We've prepared an empty row for the mapping. If a different + * thread was faster inserting it, we can drop ours. + */ + if (cmpxchg(&evtchn_to_irq[row], NULL, evtchn_row) != NULL) + free_page((unsigned long) evtchn_row); }
WRITE_ONCE(evtchn_to_irq[row][col], irq);
From: Longpeng(Mike) longpeng2@huawei.com
[ Upstream commit 49b0b6ffe20c5344f4173f3436298782a08da4f2 ]
There's a potential deadlock case when remove the vsock device or process the RESET event:
vsock_for_each_connected_socket: spin_lock_bh(&vsock_table_lock) ----------- (1) ... virtio_vsock_reset_sock: lock_sock(sk) --------------------- (2) ... spin_unlock_bh(&vsock_table_lock)
lock_sock() may do initiative schedule when the 'sk' is owned by other thread at the same time, we would receivce a warning message that "scheduling while atomic".
Even worse, if the next task (selected by the scheduler) try to release a 'sk', it need to request vsock_table_lock and the deadlock occur, cause the system into softlockup state. Call trace: queued_spin_lock_slowpath vsock_remove_bound vsock_remove_sock virtio_transport_release __vsock_release vsock_release __sock_release sock_close __fput ____fput
So we should not require sk_lock in this case, just like the behavior in vhost_vsock or vmci.
Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko") Cc: Stefan Hajnoczi stefanha@redhat.com Signed-off-by: Longpeng(Mike) longpeng2@huawei.com Reviewed-by: Stefano Garzarella sgarzare@redhat.com Link: https://lore.kernel.org/r/20210812053056.1699-1-longpeng2@huawei.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/vmw_vsock/virtio_transport.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c index 5905f0cddc89..7973f98ebd91 100644 --- a/net/vmw_vsock/virtio_transport.c +++ b/net/vmw_vsock/virtio_transport.c @@ -373,11 +373,14 @@ static void virtio_vsock_event_fill(struct virtio_vsock *vsock)
static void virtio_vsock_reset_sock(struct sock *sk) { - lock_sock(sk); + /* vmci_transport.c doesn't take sk_lock here either. At least we're + * under vsock_table_lock so the sock cannot disappear while we're + * executing. + */ + sk->sk_state = TCP_CLOSE; sk->sk_err = ECONNRESET; sk->sk_error_report(sk); - release_sock(sk); }
static void virtio_vsock_update_guest_cid(struct virtio_vsock *vsock)
From: Xie Yongji xieyongji@bytedance.com
[ Upstream commit cddce01160582a5f52ada3da9626c052d852ec42 ]
There is a race between iterating over requests in nbd_clear_que() and completing requests in recv_work(), which can lead to double completion of a request.
To fix it, flush the recv worker before iterating over the requests and don't abort the completed request while iterating.
Fixes: 96d97e17828f ("nbd: clear_sock on netlink disconnect") Reported-by: Jiang Yadong jiangyadong@bytedance.com Signed-off-by: Xie Yongji xieyongji@bytedance.com Reviewed-by: Josef Bacik josef@toxicpanda.com Link: https://lore.kernel.org/r/20210813151330.96-1-xieyongji@bytedance.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/block/nbd.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 839364371f9a..25e81b1a59a5 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -797,6 +797,10 @@ static bool nbd_clear_req(struct request *req, void *data, bool reserved) { struct nbd_cmd *cmd = blk_mq_rq_to_pdu(req);
+ /* don't abort one completed request */ + if (blk_mq_request_completed(req)) + return true; + mutex_lock(&cmd->lock); cmd->status = BLK_STS_IOERR; mutex_unlock(&cmd->lock); @@ -2009,15 +2013,19 @@ static void nbd_disconnect_and_put(struct nbd_device *nbd) { mutex_lock(&nbd->config_lock); nbd_disconnect(nbd); - nbd_clear_sock(nbd); - mutex_unlock(&nbd->config_lock); + sock_shutdown(nbd); /* * Make sure recv thread has finished, so it does not drop the last * config ref and try to destroy the workqueue from inside the work - * queue. + * queue. And this also ensure that we can safely call nbd_clear_que() + * to cancel the inflight I/Os. */ if (nbd->recv_workq) flush_workqueue(nbd->recv_workq); + nbd_clear_que(nbd); + nbd->task_setup = NULL; + mutex_unlock(&nbd->config_lock); + if (test_and_clear_bit(NBD_RT_HAS_CONFIG_REF, &nbd->config->runtime_flags)) nbd_config_put(nbd);
From: Pu Lehui pulehui@huawei.com
[ Upstream commit 43e8f76006592cb1573a959aa287c45421066f9c ]
When using kprobe on powerpc booke series processor, Oops happens as show bellow:
/ # echo "p:myprobe do_nanosleep" > /sys/kernel/debug/tracing/kprobe_events / # echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable / # sleep 1 [ 50.076730] Oops: Exception in kernel mode, sig: 5 [#1] [ 50.077017] BE PAGE_SIZE=4K SMP NR_CPUS=24 QEMU e500 [ 50.077221] Modules linked in: [ 50.077462] CPU: 0 PID: 77 Comm: sleep Not tainted 5.14.0-rc4-00022-g251a1524293d #21 [ 50.077887] NIP: c0b9c4e0 LR: c00ebecc CTR: 00000000 [ 50.078067] REGS: c3883de0 TRAP: 0700 Not tainted (5.14.0-rc4-00022-g251a1524293d) [ 50.078349] MSR: 00029000 <CE,EE,ME> CR: 24000228 XER: 20000000 [ 50.078675] [ 50.078675] GPR00: c00ebdf0 c3883e90 c313e300 c3883ea0 00000001 00000000 c3883ecc 00000001 [ 50.078675] GPR08: c100598c c00ea250 00000004 00000000 24000222 102490c2 bff4180c 101e60d4 [ 50.078675] GPR16: 00000000 102454ac 00000040 10240000 10241100 102410f8 10240000 00500000 [ 50.078675] GPR24: 00000002 00000000 c3883ea0 00000001 00000000 0000c350 3b9b8d50 00000000 [ 50.080151] NIP [c0b9c4e0] do_nanosleep+0x0/0x190 [ 50.080352] LR [c00ebecc] hrtimer_nanosleep+0x14c/0x1e0 [ 50.080638] Call Trace: [ 50.080801] [c3883e90] [c00ebdf0] hrtimer_nanosleep+0x70/0x1e0 (unreliable) [ 50.081110] [c3883f00] [c00ec004] sys_nanosleep_time32+0xa4/0x110 [ 50.081336] [c3883f40] [c001509c] ret_from_syscall+0x0/0x28 [ 50.081541] --- interrupt: c00 at 0x100a4d08 [ 50.081749] NIP: 100a4d08 LR: 101b5234 CTR: 00000003 [ 50.081931] REGS: c3883f50 TRAP: 0c00 Not tainted (5.14.0-rc4-00022-g251a1524293d) [ 50.082183] MSR: 0002f902 <CE,EE,PR,FP,ME> CR: 24000222 XER: 00000000 [ 50.082457] [ 50.082457] GPR00: 000000a2 bf980040 1024b4d0 bf980084 bf980084 64000000 00555345 fefefeff [ 50.082457] GPR08: 7f7f7f7f 101e0000 00000069 00000003 28000422 102490c2 bff4180c 101e60d4 [ 50.082457] GPR16: 00000000 102454ac 00000040 10240000 10241100 102410f8 10240000 00500000 [ 50.082457] GPR24: 00000002 bf9803f4 10240000 00000000 00000000 100039e0 00000000 102444e8 [ 50.083789] NIP [100a4d08] 0x100a4d08 [ 50.083917] LR [101b5234] 0x101b5234 [ 50.084042] --- interrupt: c00 [ 50.084238] Instruction dump: [ 50.084483] 4bfffc40 60000000 60000000 60000000 9421fff0 39400402 914200c0 38210010 [ 50.084841] 4bfffc20 00000000 00000000 00000000 <7fe00008> 7c0802a6 7c892378 93c10048 [ 50.085487] ---[ end trace f6fffe98e2fa8f3e ]--- [ 50.085678] Trace/breakpoint trap
There is no real mode for booke arch and the MMU translation is always on. The corresponding MSR_IS/MSR_DS bit in booke is used to switch the address space, but not for real mode judgment.
Fixes: 21f8b2fa3ca5 ("powerpc/kprobes: Ignore traps that happened in real mode") Signed-off-by: Pu Lehui pulehui@huawei.com Signed-off-by: Michael Ellerman mpe@ellerman.id.au Link: https://lore.kernel.org/r/20210809023658.218915-1-pulehui@huawei.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/powerpc/kernel/kprobes.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index 9b340af02c38..dd01a4abc258 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -264,7 +264,8 @@ int kprobe_handler(struct pt_regs *regs) if (user_mode(regs)) return 0;
- if (!(regs->msr & MSR_IR) || !(regs->msr & MSR_DR)) + if (!IS_ENABLED(CONFIG_BOOKE) && + (!(regs->msr & MSR_IR) || !(regs->msr & MSR_DR))) return 0;
/*
From: Randy Dunlap rdunlap@infradead.org
[ Upstream commit 839ad22f755132838f406751439363c07272ad87 ]
Skip (omit) any version string info that is parenthesized.
Warning: objdump version 15) is older than 2.19 Warning: Skipping posttest.
where 'objdump -v' says: GNU objdump (GNU Binutils; SUSE Linux Enterprise 15) 2.35.1.20201123-7.18
Fixes: 8bee738bb1979 ("x86: Fix objdump version check in chkobjdump.awk for different formats.") Signed-off-by: Randy Dunlap rdunlap@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Masami Hiramatsu mhiramat@kernel.org Link: https://lore.kernel.org/r/20210731000146.2720-1-rdunlap@infradead.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/tools/chkobjdump.awk | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/x86/tools/chkobjdump.awk b/arch/x86/tools/chkobjdump.awk index fd1ab80be0de..a4cf678cf5c8 100644 --- a/arch/x86/tools/chkobjdump.awk +++ b/arch/x86/tools/chkobjdump.awk @@ -10,6 +10,7 @@ BEGIN {
/^GNU objdump/ { verstr = "" + gsub(/(.*)/, ""); for (i = 3; i <= NF; i++) if (match($(i), "^[0-9]")) { verstr = $(i);
From: Thomas Gleixner tglx@linutronix.de
commit 826da771291fc25a428e871f9e7fb465e390f852 upstream.
X86 IO/APIC and MSI interrupts (when used without interrupts remapping) require that the affinity setup on startup is done before the interrupt is enabled for the first time as the non-remapped operation mode cannot safely migrate enabled interrupts from arbitrary contexts. Provide a new irq chip flag which allows affected hardware to request this.
This has to be opt-in because there have been reports in the past that some interrupt chips cannot handle affinity setting before startup.
Fixes: 18404756765c ("genirq: Expose default irq affinity mask (take 3)") Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Marc Zyngier maz@kernel.org Reviewed-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210729222542.779791738@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/irq.h | 2 ++ kernel/irq/chip.c | 5 ++++- 2 files changed, 6 insertions(+), 1 deletion(-)
--- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -542,6 +542,7 @@ struct irq_chip { * IRQCHIP_EOI_THREADED: Chip requires eoi() on unmask in threaded mode * IRQCHIP_SUPPORTS_LEVEL_MSI Chip can provide two doorbells for Level MSIs * IRQCHIP_SUPPORTS_NMI: Chip can deliver NMIs, only for root irqchips + * IRQCHIP_AFFINITY_PRE_STARTUP: Default affinity update before startup */ enum { IRQCHIP_SET_TYPE_MASKED = (1 << 0), @@ -553,6 +554,7 @@ enum { IRQCHIP_EOI_THREADED = (1 << 6), IRQCHIP_SUPPORTS_LEVEL_MSI = (1 << 7), IRQCHIP_SUPPORTS_NMI = (1 << 8), + IRQCHIP_AFFINITY_PRE_STARTUP = (1 << 10), };
#include <linux/irqdesc.h> --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -265,8 +265,11 @@ int irq_startup(struct irq_desc *desc, b } else { switch (__irq_startup_managed(desc, aff, force)) { case IRQ_STARTUP_NORMAL: + if (d->chip->flags & IRQCHIP_AFFINITY_PRE_STARTUP) + irq_setup_affinity(desc); ret = __irq_startup(desc); - irq_setup_affinity(desc); + if (!(d->chip->flags & IRQCHIP_AFFINITY_PRE_STARTUP)) + irq_setup_affinity(desc); break; case IRQ_STARTUP_MANAGED: irq_do_set_affinity(d, aff, false);
From: Thomas Gleixner tglx@linutronix.de
commit ff363f480e5997051dd1de949121ffda3b753741 upstream.
The X86 MSI mechanism cannot handle interrupt affinity changes safely after startup other than from an interrupt handler, unless interrupt remapping is enabled. The startup sequence in the generic interrupt code violates that assumption.
Mark the irq chips with the new IRQCHIP_AFFINITY_PRE_STARTUP flag so that the default interrupt setting happens before the interrupt is started up for the first time.
While the interrupt remapping MSI chip does not require this, there is no point in treating it differently as this might spare an interrupt to a CPU which is not in the default affinity mask.
For the non-remapping case go to the direct write path when the interrupt is not yet started similar to the not yet activated case.
Fixes: 18404756765c ("genirq: Expose default irq affinity mask (take 3)") Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Marc Zyngier maz@kernel.org Reviewed-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210729222542.886722080@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/apic/msi.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)
--- a/arch/x86/kernel/apic/msi.c +++ b/arch/x86/kernel/apic/msi.c @@ -86,11 +86,13 @@ msi_set_affinity(struct irq_data *irqd, * The quirk bit is not set in this case. * - The new vector is the same as the old vector * - The old vector is MANAGED_IRQ_SHUTDOWN_VECTOR (interrupt starts up) + * - The interrupt is not yet started up * - The new destination CPU is the same as the old destination CPU */ if (!irqd_msi_nomask_quirk(irqd) || cfg->vector == old_cfg.vector || old_cfg.vector == MANAGED_IRQ_SHUTDOWN_VECTOR || + !irqd_is_started(irqd) || cfg->dest_apicid == old_cfg.dest_apicid) { irq_msi_update_msg(irqd, cfg); return ret; @@ -178,7 +180,8 @@ static struct irq_chip pci_msi_controlle .irq_retrigger = irq_chip_retrigger_hierarchy, .irq_compose_msi_msg = irq_msi_compose_msg, .irq_set_affinity = msi_set_affinity, - .flags = IRQCHIP_SKIP_SET_WAKE, + .flags = IRQCHIP_SKIP_SET_WAKE | + IRQCHIP_AFFINITY_PRE_STARTUP, };
int native_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) @@ -279,7 +282,8 @@ static struct irq_chip pci_msi_ir_contro .irq_ack = irq_chip_ack_parent, .irq_retrigger = irq_chip_retrigger_hierarchy, .irq_set_vcpu_affinity = irq_chip_set_vcpu_affinity_parent, - .flags = IRQCHIP_SKIP_SET_WAKE, + .flags = IRQCHIP_SKIP_SET_WAKE | + IRQCHIP_AFFINITY_PRE_STARTUP, };
static struct msi_domain_info pci_msi_ir_domain_info = { @@ -322,7 +326,8 @@ static struct irq_chip dmar_msi_controll .irq_retrigger = irq_chip_retrigger_hierarchy, .irq_compose_msi_msg = irq_msi_compose_msg, .irq_write_msi_msg = dmar_msi_write_msg, - .flags = IRQCHIP_SKIP_SET_WAKE, + .flags = IRQCHIP_SKIP_SET_WAKE | + IRQCHIP_AFFINITY_PRE_STARTUP, };
static irq_hw_number_t dmar_msi_get_hwirq(struct msi_domain_info *info, @@ -420,7 +425,7 @@ static struct irq_chip hpet_msi_controll .irq_retrigger = irq_chip_retrigger_hierarchy, .irq_compose_msi_msg = irq_msi_compose_msg, .irq_write_msi_msg = hpet_msi_write_msg, - .flags = IRQCHIP_SKIP_SET_WAKE, + .flags = IRQCHIP_SKIP_SET_WAKE | IRQCHIP_AFFINITY_PRE_STARTUP, };
static irq_hw_number_t hpet_msi_get_hwirq(struct msi_domain_info *info,
From: Thomas Gleixner tglx@linutronix.de
commit 0c0e37dc11671384e53ba6ede53a4d91162a2cc5 upstream.
The IO/APIC cannot handle interrupt affinity changes safely after startup other than from an interrupt handler. The startup sequence in the generic interrupt code violates that assumption.
Mark the irq chip with the new IRQCHIP_AFFINITY_PRE_STARTUP flag so that the default interrupt setting happens before the interrupt is started up for the first time.
Fixes: 18404756765c ("genirq: Expose default irq affinity mask (take 3)") Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Marc Zyngier maz@kernel.org Reviewed-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210729222542.832143400@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/apic/io_apic.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
--- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -1961,7 +1961,8 @@ static struct irq_chip ioapic_chip __rea .irq_set_affinity = ioapic_set_affinity, .irq_retrigger = irq_chip_retrigger_hierarchy, .irq_get_irqchip_state = ioapic_irq_get_chip_state, - .flags = IRQCHIP_SKIP_SET_WAKE, + .flags = IRQCHIP_SKIP_SET_WAKE | + IRQCHIP_AFFINITY_PRE_STARTUP, };
static struct irq_chip ioapic_ir_chip __read_mostly = { @@ -1974,7 +1975,8 @@ static struct irq_chip ioapic_ir_chip __ .irq_set_affinity = ioapic_set_affinity, .irq_retrigger = irq_chip_retrigger_hierarchy, .irq_get_irqchip_state = ioapic_irq_get_chip_state, - .flags = IRQCHIP_SKIP_SET_WAKE, + .flags = IRQCHIP_SKIP_SET_WAKE | + IRQCHIP_AFFINITY_PRE_STARTUP, };
static inline void init_IO_APIC_traps(void)
From: Babu Moger Babu.Moger@amd.com
commit 064855a69003c24bd6b473b367d364e418c57625 upstream.
Creating a new sub monitoring group in the root /sys/fs/resctrl leads to getting the "Unavailable" value for mbm_total_bytes and mbm_local_bytes on the entire filesystem.
Steps to reproduce:
1. mount -t resctrl resctrl /sys/fs/resctrl/
2. cd /sys/fs/resctrl/
3. cat mon_data/mon_L3_00/mbm_total_bytes 23189832
4. Create sub monitor group: mkdir mon_groups/test1
5. cat mon_data/mon_L3_00/mbm_total_bytes Unavailable
When a new monitoring group is created, a new RMID is assigned to the new group. But the RMID is not active yet. When the events are read on the new RMID, it is expected to report the status as "Unavailable".
When the user reads the events on the default monitoring group with multiple subgroups, the events on all subgroups are consolidated together. Currently, if any of the RMID reads report as "Unavailable", then everything will be reported as "Unavailable".
Fix the issue by discarding the "Unavailable" reads and reporting all the successful RMID reads. This is not a problem on Intel systems as Intel reports 0 on Inactive RMIDs.
Fixes: d89b7379015f ("x86/intel_rdt/cqm: Add mon_data") Reported-by: Paweł Szulik pawel.szulik@intel.com Signed-off-by: Babu Moger Babu.Moger@amd.com Signed-off-by: Borislav Petkov bp@suse.de Acked-by: Reinette Chatre reinette.chatre@intel.com Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=213311 Link: https://lkml.kernel.org/r/162793309296.9224.15871659871696482080.stgit@bmoge... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/cpu/resctrl/monitor.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-)
--- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -223,15 +223,14 @@ static u64 mbm_overflow_count(u64 prev_m return chunks >>= shift; }
-static int __mon_event_count(u32 rmid, struct rmid_read *rr) +static u64 __mon_event_count(u32 rmid, struct rmid_read *rr) { struct mbm_state *m; u64 chunks, tval;
tval = __rmid_read(rmid, rr->evtid); if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) { - rr->val = tval; - return -EINVAL; + return tval; } switch (rr->evtid) { case QOS_L3_OCCUP_EVENT_ID: @@ -243,12 +242,6 @@ static int __mon_event_count(u32 rmid, s case QOS_L3_MBM_LOCAL_EVENT_ID: m = &rr->d->mbm_local[rmid]; break; - default: - /* - * Code would never reach here because - * an invalid event id would fail the __rmid_read. - */ - return -EINVAL; }
if (rr->first) { @@ -298,23 +291,29 @@ void mon_event_count(void *info) struct rdtgroup *rdtgrp, *entry; struct rmid_read *rr = info; struct list_head *head; + u64 ret_val;
rdtgrp = rr->rgrp;
- if (__mon_event_count(rdtgrp->mon.rmid, rr)) - return; + ret_val = __mon_event_count(rdtgrp->mon.rmid, rr);
/* - * For Ctrl groups read data from child monitor groups. + * For Ctrl groups read data from child monitor groups and + * add them together. Count events which are read successfully. + * Discard the rmid_read's reporting errors. */ head = &rdtgrp->mon.crdtgrp_list;
if (rdtgrp->type == RDTCTRL_GROUP) { list_for_each_entry(entry, head, mon.crdtgrp_list) { - if (__mon_event_count(entry->mon.rmid, rr)) - return; + if (__mon_event_count(entry->mon.rmid, rr) == 0) + ret_val = 0; } } + + /* Report error if none of rmid_reads are successful */ + if (ret_val) + rr->val = ret_val; }
/*
From: Bixuan Cui cuibixuan@huawei.com
commit dbbc93576e03fbe24b365fab0e901eb442237a8a upstream.
msi_domain_alloc_irqs() invokes irq_domain_activate_irq(), but msi_domain_free_irqs() does not enforce deactivation before tearing down the interrupts.
This happens when PCI/MSI interrupts are set up and never used before being torn down again, e.g. in error handling pathes. The only place which cleans that up is the error handling path in msi_domain_alloc_irqs().
Move the cleanup from msi_domain_alloc_irqs() into msi_domain_free_irqs() to cure that.
Fixes: f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early") Signed-off-by: Bixuan Cui cuibixuan@huawei.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210518033117.78104-1-cuibixuan@huawei.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/irq/msi.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)
--- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -477,11 +477,6 @@ skip_activate: return 0;
cleanup: - for_each_msi_vector(desc, i, dev) { - irq_data = irq_domain_get_irq_data(domain, i); - if (irqd_is_activated(irq_data)) - irq_domain_deactivate_irq(irq_data); - } msi_domain_free_irqs(domain, dev); return ret; } @@ -494,7 +489,15 @@ cleanup: */ void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev) { + struct irq_data *irq_data; struct msi_desc *desc; + int i; + + for_each_msi_vector(desc, i, dev) { + irq_data = irq_domain_get_irq_data(domain, i); + if (irqd_is_activated(irq_data)) + irq_domain_deactivate_irq(irq_data); + }
for_each_msi_entry(desc, dev) { /*
From: Ben Dai ben.dai@unisoc.com
commit b9cc7d8a4656a6e815852c27ab50365009cb69c1 upstream.
When the interrupt interval is greater than 2 ^ PREDICTION_BUFFER_SIZE * PREDICTION_FACTOR us and less than 1s, the calculated index will be greater than the length of irqs->ema_time[]. Check the calculated index before using it to prevent array overflow.
Fixes: 23aa3b9a6b7d ("genirq/timings: Encapsulate storing function") Signed-off-by: Ben Dai ben.dai@unisoc.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210425150903.25456-1-ben.dai9703@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/irq/timings.c | 5 +++++ 1 file changed, 5 insertions(+)
--- a/kernel/irq/timings.c +++ b/kernel/irq/timings.c @@ -453,6 +453,11 @@ static __always_inline void __irq_timing */ index = irq_timings_interval_index(interval);
+ if (index > PREDICTION_BUFFER_SIZE - 1) { + irqs->count = 0; + return; + } + /* * Store the index as an element of the pattern in another * circular array.
From: Thomas Gleixner tglx@linutronix.de
commit 438553958ba19296663c6d6583d208dfb6792830 upstream.
The ordering of MSI-X enable in hardware is dysfunctional:
1) MSI-X is disabled in the control register 2) Various setup functions 3) pci_msi_setup_msi_irqs() is invoked which ends up accessing the MSI-X table entries 4) MSI-X is enabled and masked in the control register with the comment that enabling is required for some hardware to access the MSI-X table
Step #4 obviously contradicts #3. The history of this is an issue with the NIU hardware. When #4 was introduced the table access actually happened in msix_program_entries() which was invoked after enabling and masking MSI-X.
This was changed in commit d71d6432e105 ("PCI/MSI: Kill redundant call of irq_set_msi_desc() for MSI-X interrupts") which removed the table write from msix_program_entries().
Interestingly enough nobody noticed and either NIU still works or it did not get any testing with a kernel 3.19 or later.
Nevertheless this is inconsistent and there is no reason why MSI-X can't be enabled and masked in the control register early on, i.e. move step #4 above to step #1. This preserves the NIU workaround and has no side effects on other hardware.
Fixes: d71d6432e105 ("PCI/MSI: Kill redundant call of irq_set_msi_desc() for MSI-X interrupts") Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Marc Zyngier maz@kernel.org Reviewed-by: Ashok Raj ashok.raj@intel.com Reviewed-by: Marc Zyngier maz@kernel.org Acked-by: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210729222542.344136412@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/pci/msi.c | 28 +++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-)
--- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -778,18 +778,25 @@ static int msix_capability_init(struct p u16 control; void __iomem *base;
- /* Ensure MSI-X is disabled while it is set up */ - pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE, 0); + /* + * Some devices require MSI-X to be enabled before the MSI-X + * registers can be accessed. Mask all the vectors to prevent + * interrupts coming in before they're fully set up. + */ + pci_msix_clear_and_set_ctrl(dev, 0, PCI_MSIX_FLAGS_MASKALL | + PCI_MSIX_FLAGS_ENABLE);
pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &control); /* Request & Map MSI-X table region */ base = msix_map_region(dev, msix_table_size(control)); - if (!base) - return -ENOMEM; + if (!base) { + ret = -ENOMEM; + goto out_disable; + }
ret = msix_setup_entries(dev, base, entries, nvec, affd); if (ret) - return ret; + goto out_disable;
ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX); if (ret) @@ -800,14 +807,6 @@ static int msix_capability_init(struct p if (ret) goto out_free;
- /* - * Some devices require MSI-X to be enabled before we can touch the - * MSI-X registers. We need to mask all the vectors to prevent - * interrupts coming in before they're fully set up. - */ - pci_msix_clear_and_set_ctrl(dev, 0, - PCI_MSIX_FLAGS_MASKALL | PCI_MSIX_FLAGS_ENABLE); - msix_program_entries(dev, entries);
ret = populate_msi_sysfs(dev); @@ -842,6 +841,9 @@ out_avail: out_free: free_msi_irqs(dev);
+out_disable: + pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE, 0); + return ret; }
Hi!
I'm sorry to report here, but 4.4 patches were not yet sent to the lists (and it may be worth correcting before release).
+++ b/drivers/pci/msi.c @@ -778,18 +778,25 @@ static int msix_capability_init(struct p
...
pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &control); /* Request & Map MSI-X table region */ base = msix_map_region(dev, msix_table_size(control));
- if (!base)
return -ENOMEM;
- if (!base) {
ret = -ENOMEM;
goto out_disable;
- }
ret = msix_setup_entries(dev, base, entries, nvec, affd); if (ret)
This one is correct, and so is the version queued for 4.19, but 4.4 version (9da69496e86237e94c4ffa2a5b375a4d2ee7c482) has:
/* Request & Map MSI-X table region */ base = msix_map_region(dev, msix_table_size(control)); - if (!base) + if (!base) { return -ENOMEM; + goto out_disable; + }
ret = msix_setup_entries(dev, base, entries, nvec);
That return is misplaced, it should be ret = -ENOMEM, similar to 4.19 and newer.
Best regards, Pavel -- DENX Software Engineering GmbH, Managing Director: Wolfgang Denk HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
On Tue, Aug 17, 2021 at 09:36:55AM +0200, Pavel Machek wrote:
Hi!
I'm sorry to report here, but 4.4 patches were not yet sent to the lists (and it may be worth correcting before release).
Yes, they are known to not be complete and incorrect at the moment, others have reported this to me. I will be working on these later today, thanks.
greg k-h
On Wed, Aug 18, 2021 at 08:53:45AM +0200, Greg Kroah-Hartman wrote:
On Tue, Aug 17, 2021 at 09:36:55AM +0200, Pavel Machek wrote:
Hi!
I'm sorry to report here, but 4.4 patches were not yet sent to the lists (and it may be worth correcting before release).
Yes, they are known to not be complete and incorrect at the moment, others have reported this to me. I will be working on these later today, thanks.
Now should be all fixed up thanks to some patches sent by Thomas.
greg k-h
On Wed 2021-08-18 11:19:58, Greg Kroah-Hartman wrote:
On Wed, Aug 18, 2021 at 08:53:45AM +0200, Greg Kroah-Hartman wrote:
On Tue, Aug 17, 2021 at 09:36:55AM +0200, Pavel Machek wrote:
Hi!
I'm sorry to report here, but 4.4 patches were not yet sent to the lists (and it may be worth correcting before release).
Yes, they are known to not be complete and incorrect at the moment, others have reported this to me. I will be working on these later today, thanks.
Now should be all fixed up thanks to some patches sent by Thomas.
I can't spot a bug any more; thank you, Pavel
From: Thomas Gleixner tglx@linutronix.de
commit 7d5ec3d3612396dc6d4b76366d20ab9fc06f399f upstream.
When MSI-X is enabled the ordering of calls is:
msix_map_region(); msix_setup_entries(); pci_msi_setup_msi_irqs(); msix_program_entries();
This has a few interesting issues:
1) msix_setup_entries() allocates the MSI descriptors and initializes them except for the msi_desc:masked member which is left zero initialized.
2) pci_msi_setup_msi_irqs() allocates the interrupt descriptors and sets up the MSI interrupts which ends up in pci_write_msi_msg() unless the interrupt chip provides its own irq_write_msi_msg() function.
3) msix_program_entries() does not do what the name suggests. It solely updates the entries array (if not NULL) and initializes the masked member for each MSI descriptor by reading the hardware state and then masks the entry.
Obviously this has some issues:
1) The uninitialized masked member of msi_desc prevents the enforcement of masking the entry in pci_write_msi_msg() depending on the cached masked bit. Aside of that half initialized data is a NONO in general
2) msix_program_entries() only ensures that the actually allocated entries are masked. This is wrong as experimentation with crash testing and crash kernel kexec has shown.
This limited testing unearthed that when the production kernel had more entries in use and unmasked when it crashed and the crash kernel allocated a smaller amount of entries, then a full scan of all entries found unmasked entries which were in use in the production kernel.
This is obviously a device or emulation issue as the device reset should mask all MSI-X table entries, but obviously that's just part of the paper specification.
Cure this by:
1) Masking all table entries in hardware 2) Initializing msi_desc::masked in msix_setup_entries() 3) Removing the mask dance in msix_program_entries() 4) Renaming msix_program_entries() to msix_update_entries() to reflect the purpose of that function.
As the masking of unused entries has never been done the Fixes tag refers to a commit in: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: f036d4ea5fa7 ("[PATCH] ia32 Message Signalled Interrupt support") Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Marc Zyngier maz@kernel.org Reviewed-by: Marc Zyngier maz@kernel.org Acked-by: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210729222542.403833459@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/pci/msi.c | 45 +++++++++++++++++++++++++++------------------ 1 file changed, 27 insertions(+), 18 deletions(-)
--- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -697,6 +697,7 @@ static int msix_setup_entries(struct pci { struct irq_affinity_desc *curmsk, *masks = NULL; struct msi_desc *entry; + void __iomem *addr; int ret, i; int vec_count = pci_msix_vec_count(dev);
@@ -717,6 +718,7 @@ static int msix_setup_entries(struct pci
entry->msi_attrib.is_msix = 1; entry->msi_attrib.is_64 = 1; + if (entries) entry->msi_attrib.entry_nr = entries[i].entry; else @@ -728,6 +730,10 @@ static int msix_setup_entries(struct pci entry->msi_attrib.default_irq = dev->irq; entry->mask_base = base;
+ addr = pci_msix_desc_addr(entry); + if (addr) + entry->masked = readl(addr + PCI_MSIX_ENTRY_VECTOR_CTRL); + list_add_tail(&entry->list, dev_to_msi_list(&dev->dev)); if (masks) curmsk++; @@ -738,26 +744,25 @@ out: return ret; }
-static void msix_program_entries(struct pci_dev *dev, - struct msix_entry *entries) +static void msix_update_entries(struct pci_dev *dev, struct msix_entry *entries) { struct msi_desc *entry; - int i = 0; - void __iomem *desc_addr;
for_each_pci_msi_entry(entry, dev) { - if (entries) - entries[i++].vector = entry->irq; + if (entries) { + entries->vector = entry->irq; + entries++; + } + } +}
- desc_addr = pci_msix_desc_addr(entry); - if (desc_addr) - entry->masked = readl(desc_addr + - PCI_MSIX_ENTRY_VECTOR_CTRL); - else - entry->masked = 0; +static void msix_mask_all(void __iomem *base, int tsize) +{ + u32 ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT; + int i;
- msix_mask_irq(entry, 1); - } + for (i = 0; i < tsize; i++, base += PCI_MSIX_ENTRY_SIZE) + writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL); }
/** @@ -774,9 +779,9 @@ static void msix_program_entries(struct static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries, int nvec, struct irq_affinity *affd) { - int ret; - u16 control; void __iomem *base; + int ret, tsize; + u16 control;
/* * Some devices require MSI-X to be enabled before the MSI-X @@ -788,12 +793,16 @@ static int msix_capability_init(struct p
pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &control); /* Request & Map MSI-X table region */ - base = msix_map_region(dev, msix_table_size(control)); + tsize = msix_table_size(control); + base = msix_map_region(dev, tsize); if (!base) { ret = -ENOMEM; goto out_disable; }
+ /* Ensure that all table entries are masked. */ + msix_mask_all(base, tsize); + ret = msix_setup_entries(dev, base, entries, nvec, affd); if (ret) goto out_disable; @@ -807,7 +816,7 @@ static int msix_capability_init(struct p if (ret) goto out_free;
- msix_program_entries(dev, entries); + msix_update_entries(dev, entries);
ret = populate_msi_sysfs(dev); if (ret)
From: Thomas Gleixner tglx@linutronix.de
commit da181dc974ad667579baece33c2c8d2d1e4558d5 upstream.
The specification (PCIe r5.0, sec 6.1.4.5) states:
For MSI-X, a function is permitted to cache Address and Data values from unmasked MSI-X Table entries. However, anytime software unmasks a currently masked MSI-X Table entry either by clearing its Mask bit or by clearing the Function Mask bit, the function must update any Address or Data values that it cached from that entry. If software changes the Address or Data value of an entry while the entry is unmasked, the result is undefined.
The Linux kernel's MSI-X support never enforced that the entry is masked before the entry is modified hence the Fixes tag refers to a commit in: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Enforce the entry to be masked across the update.
There is no point in enforcing this to be handled at all possible call sites as this is just pointless code duplication and the common update function is the obvious place to enforce this.
Fixes: f036d4ea5fa7 ("[PATCH] ia32 Message Signalled Interrupt support") Reported-by: Kevin Tian kevin.tian@intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Marc Zyngier maz@kernel.org Reviewed-by: Marc Zyngier maz@kernel.org Acked-by: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210729222542.462096385@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/pci/msi.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+)
--- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -316,13 +316,28 @@ void __pci_write_msi_msg(struct msi_desc /* Don't touch the hardware now */ } else if (entry->msi_attrib.is_msix) { void __iomem *base = pci_msix_desc_addr(entry); + bool unmasked = !(entry->masked & PCI_MSIX_ENTRY_CTRL_MASKBIT);
if (!base) goto skip;
+ /* + * The specification mandates that the entry is masked + * when the message is modified: + * + * "If software changes the Address or Data value of an + * entry while the entry is unmasked, the result is + * undefined." + */ + if (unmasked) + __pci_msix_desc_mask_irq(entry, PCI_MSIX_ENTRY_CTRL_MASKBIT); + writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR); writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR); writel(msg->data, base + PCI_MSIX_ENTRY_DATA); + + if (unmasked) + __pci_msix_desc_mask_irq(entry, 0); } else { int pos = dev->msi_cap; u16 msgctl;
From: Thomas Gleixner tglx@linutronix.de
commit b9255a7cb51754e8d2645b65dd31805e282b4f3e upstream.
Nothing enforces the posted writes to be visible when the function returns. Flush them even if the flush might be redundant when the entry is masked already as the unmask will flush as well. This is either setup or a rare affinity change event so the extra flush is not the end of the world.
While this is more a theoretical issue especially the logic in the X86 specific msi_set_affinity() function relies on the assumption that the update has reached the hardware when the function returns.
Again, as this never has been enforced the Fixes tag refers to a commit in: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: f036d4ea5fa7 ("[PATCH] ia32 Message Signalled Interrupt support") Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Marc Zyngier maz@kernel.org Reviewed-by: Marc Zyngier maz@kernel.org Acked-by: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210729222542.515188147@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/pci/msi.c | 5 +++++ 1 file changed, 5 insertions(+)
--- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -338,6 +338,9 @@ void __pci_write_msi_msg(struct msi_desc
if (unmasked) __pci_msix_desc_mask_irq(entry, 0); + + /* Ensure that the writes are visible in the device */ + readl(base + PCI_MSIX_ENTRY_DATA); } else { int pos = dev->msi_cap; u16 msgctl; @@ -358,6 +361,8 @@ void __pci_write_msi_msg(struct msi_desc pci_write_config_word(dev, pos + PCI_MSI_DATA_32, msg->data); } + /* Ensure that the writes are visible in the device */ + pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl); }
skip:
From: Thomas Gleixner tglx@linutronix.de
commit 361fd37397f77578735907341579397d5bed0a2d upstream.
msi_mask_irq() takes a mask and a flags argument. The mask argument is used to mask out bits from the cached mask and the flags argument to set bits.
Some places invoke it with a flags argument which sets bits which are not used by the device, i.e. when the device supports up to 8 vectors a full unmask in some places sets the mask to 0xFFFFFF00. While devices probably do not care, it's still bad practice.
Fixes: 7ba1930db02f ("PCI MSI: Unmask MSI if setup failed") Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Marc Zyngier maz@kernel.org Reviewed-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210729222542.568173099@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/pci/msi.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
--- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -662,21 +662,21 @@ static int msi_capability_init(struct pc /* Configure MSI capability structure */ ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI); if (ret) { - msi_mask_irq(entry, mask, ~mask); + msi_mask_irq(entry, mask, 0); free_msi_irqs(dev); return ret; }
ret = msi_verify_entries(dev); if (ret) { - msi_mask_irq(entry, mask, ~mask); + msi_mask_irq(entry, mask, 0); free_msi_irqs(dev); return ret; }
ret = populate_msi_sysfs(dev); if (ret) { - msi_mask_irq(entry, mask, ~mask); + msi_mask_irq(entry, mask, 0); free_msi_irqs(dev); return ret; } @@ -961,7 +961,7 @@ static void pci_msi_shutdown(struct pci_ /* Return the device with MSI unmasked as initial states */ mask = msi_mask(desc->msi_attrib.multi_cap); /* Keep cached state to be restored */ - __pci_msi_desc_mask_irq(desc, mask, ~mask); + __pci_msi_desc_mask_irq(desc, mask, 0);
/* Restore dev->irq to its default pin-assertion IRQ */ dev->irq = desc->msi_attrib.default_irq;
From: Thomas Gleixner tglx@linutronix.de
commit 689e6b5351573c38ccf92a0dd8b3e2c2241e4aff upstream.
The comments about preserving the cached state in pci_msi[x]_shutdown() are misleading as the MSI descriptors are freed right after those functions return. So there is nothing to restore. Preparatory change.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Marc Zyngier maz@kernel.org Reviewed-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210729222542.621609423@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/pci/msi.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
--- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -960,7 +960,6 @@ static void pci_msi_shutdown(struct pci_
/* Return the device with MSI unmasked as initial states */ mask = msi_mask(desc->msi_attrib.multi_cap); - /* Keep cached state to be restored */ __pci_msi_desc_mask_irq(desc, mask, 0);
/* Restore dev->irq to its default pin-assertion IRQ */ @@ -1046,10 +1045,8 @@ static void pci_msix_shutdown(struct pci }
/* Return the device with MSI-X masked as initial states */ - for_each_pci_msi_entry(entry, dev) { - /* Keep cached states to be restored */ + for_each_pci_msi_entry(entry, dev) __pci_msix_desc_mask_irq(entry, 1); - }
pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE, 0); pci_intx_for_msi(dev, 1);
From: Thomas Gleixner tglx@linutronix.de
commit d28d4ad2a1aef27458b3383725bb179beb8d015c upstream.
No point in using the raw write function from shutdown. Preparatory change to introduce proper serialization for the msi_desc::masked cache.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Marc Zyngier maz@kernel.org Reviewed-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210729222542.674391354@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/pci/msi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -960,7 +960,7 @@ static void pci_msi_shutdown(struct pci_
/* Return the device with MSI unmasked as initial states */ mask = msi_mask(desc->msi_attrib.multi_cap); - __pci_msi_desc_mask_irq(desc, mask, 0); + msi_mask_irq(desc, mask, 0);
/* Restore dev->irq to its default pin-assertion IRQ */ dev->irq = desc->msi_attrib.default_irq;
From: Thomas Gleixner tglx@linutronix.de
commit 77e89afc25f30abd56e76a809ee2884d7c1b63ce upstream.
Multi-MSI uses a single MSI descriptor and there is a single mask register when the device supports per vector masking. To avoid reading back the mask register the value is cached in the MSI descriptor and updates are done by clearing and setting bits in the cache and writing it to the device.
But nothing protects msi_desc::masked and the mask register from being modified concurrently on two different CPUs for two different Linux interrupts which belong to the same multi-MSI descriptor.
Add a lock to struct device and protect any operation on the mask and the mask register with it.
This makes the update of msi_desc::masked unconditional, but there is no place which requires a modification of the hardware register without updating the masked cache.
msi_mask_irq() is now an empty wrapper which will be cleaned up in follow up changes.
The problem goes way back to the initial support of multi-MSI, but picking the commit which introduced the mask cache is a valid cut off point (2.6.30).
Fixes: f2440d9acbe8 ("PCI MSI: Refactor interrupt masking code") Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Marc Zyngier maz@kernel.org Reviewed-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210729222542.726833414@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/base/core.c | 1 + drivers/pci/msi.c | 19 ++++++++++--------- include/linux/device.h | 1 + include/linux/msi.h | 2 +- 4 files changed, 13 insertions(+), 10 deletions(-)
--- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -1722,6 +1722,7 @@ void device_initialize(struct device *de device_pm_init(dev); set_dev_node(dev, -1); #ifdef CONFIG_GENERIC_MSI_IRQ + raw_spin_lock_init(&dev->msi_lock); INIT_LIST_HEAD(&dev->msi_list); #endif INIT_LIST_HEAD(&dev->links.consumers); --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -170,24 +170,25 @@ static inline __attribute_const__ u32 ms * reliably as devices without an INTx disable bit will then generate a * level IRQ which will never be cleared. */ -u32 __pci_msi_desc_mask_irq(struct msi_desc *desc, u32 mask, u32 flag) +void __pci_msi_desc_mask_irq(struct msi_desc *desc, u32 mask, u32 flag) { - u32 mask_bits = desc->masked; + raw_spinlock_t *lock = &desc->dev->msi_lock; + unsigned long flags;
if (pci_msi_ignore_mask || !desc->msi_attrib.maskbit) - return 0; + return;
- mask_bits &= ~mask; - mask_bits |= flag; + raw_spin_lock_irqsave(lock, flags); + desc->masked &= ~mask; + desc->masked |= flag; pci_write_config_dword(msi_desc_to_pci_dev(desc), desc->mask_pos, - mask_bits); - - return mask_bits; + desc->masked); + raw_spin_unlock_irqrestore(lock, flags); }
static void msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag) { - desc->masked = __pci_msi_desc_mask_irq(desc, mask, flag); + __pci_msi_desc_mask_irq(desc, mask, flag); }
static void __iomem *pci_msix_desc_addr(struct msi_desc *desc) --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1260,6 +1260,7 @@ struct device { struct dev_pin_info *pins; #endif #ifdef CONFIG_GENERIC_MSI_IRQ + raw_spinlock_t msi_lock; struct list_head msi_list; #endif
--- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -194,7 +194,7 @@ void __pci_read_msi_msg(struct msi_desc void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag); -u32 __pci_msi_desc_mask_irq(struct msi_desc *desc, u32 mask, u32 flag); +void __pci_msi_desc_mask_irq(struct msi_desc *desc, u32 mask, u32 flag); void pci_msi_mask_irq(struct irq_data *data); void pci_msi_unmask_irq(struct irq_data *data);
From: Sean Christopherson seanjc@google.com
commit 7b9cae027ba3aaac295ae23a62f47876ed97da73 upstream.
Use the secondary_exec_controls_get() accessor in vmx_has_waitpkg() to effectively get the controls for the current VMCS, as opposed to using vmx->secondary_exec_controls, which is the cached value of KVM's desired controls for vmcs01 and truly not reflective of any particular VMCS.
While the waitpkg control is not dynamic, i.e. vmcs01 will always hold the same waitpkg configuration as vmx->secondary_exec_controls, the same does not hold true for vmcs02 if the L1 VMM hides the feature from L2. If L1 hides the feature _and_ does not intercept MSR_IA32_UMWAIT_CONTROL, L2 could incorrectly read/write L1's virtual MSR instead of taking a #GP.
Fixes: 6e3ba4abcea5 ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson seanjc@google.com Message-Id: 20210810171952.2758100-2-seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kvm/vmx/vmx.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -512,7 +512,7 @@ static inline void decache_tsc_multiplie
static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx) { - return vmx->secondary_exec_control & + return secondary_exec_controls_get(vmx) & SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE; }
From: Jeff Layton jlayton@kernel.org
commit a6862e6708c15995bc10614b2ef34ca35b4b9078 upstream.
Turn some comments into lockdep asserts.
Signed-off-by: Jeff Layton jlayton@kernel.org Reviewed-by: Ilya Dryomov idryomov@gmail.com Signed-off-by: Ilya Dryomov idryomov@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/ceph/snap.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)
--- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -65,6 +65,8 @@ void ceph_get_snap_realm(struct ceph_mds_client *mdsc, struct ceph_snap_realm *realm) { + lockdep_assert_held_write(&mdsc->snap_rwsem); + dout("get_realm %p %d -> %d\n", realm, atomic_read(&realm->nref), atomic_read(&realm->nref)+1); /* @@ -113,6 +115,8 @@ static struct ceph_snap_realm *ceph_crea { struct ceph_snap_realm *realm;
+ lockdep_assert_held_write(&mdsc->snap_rwsem); + realm = kzalloc(sizeof(*realm), GFP_NOFS); if (!realm) return ERR_PTR(-ENOMEM); @@ -143,6 +147,8 @@ static struct ceph_snap_realm *__lookup_ struct rb_node *n = mdsc->snap_realms.rb_node; struct ceph_snap_realm *r;
+ lockdep_assert_held_write(&mdsc->snap_rwsem); + while (n) { r = rb_entry(n, struct ceph_snap_realm, node); if (ino < r->ino) @@ -176,6 +182,8 @@ static void __put_snap_realm(struct ceph static void __destroy_snap_realm(struct ceph_mds_client *mdsc, struct ceph_snap_realm *realm) { + lockdep_assert_held_write(&mdsc->snap_rwsem); + dout("__destroy_snap_realm %p %llx\n", realm, realm->ino);
rb_erase(&realm->node, &mdsc->snap_realms); @@ -198,6 +206,8 @@ static void __destroy_snap_realm(struct static void __put_snap_realm(struct ceph_mds_client *mdsc, struct ceph_snap_realm *realm) { + lockdep_assert_held_write(&mdsc->snap_rwsem); + dout("__put_snap_realm %llx %p %d -> %d\n", realm->ino, realm, atomic_read(&realm->nref), atomic_read(&realm->nref)-1); if (atomic_dec_and_test(&realm->nref)) @@ -236,6 +246,8 @@ static void __cleanup_empty_realms(struc { struct ceph_snap_realm *realm;
+ lockdep_assert_held_write(&mdsc->snap_rwsem); + spin_lock(&mdsc->snap_empty_lock); while (!list_empty(&mdsc->snap_empty)) { realm = list_first_entry(&mdsc->snap_empty, @@ -269,6 +281,8 @@ static int adjust_snap_realm_parent(stru { struct ceph_snap_realm *parent;
+ lockdep_assert_held_write(&mdsc->snap_rwsem); + if (realm->parent_ino == parentino) return 0;
@@ -686,6 +700,8 @@ int ceph_update_snap_trace(struct ceph_m int err = -ENOMEM; LIST_HEAD(dirty_realms);
+ lockdep_assert_held_write(&mdsc->snap_rwsem); + dout("update_snap_trace deletion=%d\n", deletion); more: ceph_decode_need(&p, e, sizeof(*ri), bad);
From: Jeff Layton jlayton@kernel.org
commit df2c0cb7f8e8c83e495260ad86df8c5da947f2a7 upstream.
They both say that the snap_rwsem must be held for write, but I don't see any real reason for it, and it's not currently always called that way.
The lookup is just walking the rbtree, so holding it for read should be fine there. The "get" is bumping the refcount and (possibly) removing it from the empty list. I see no need to hold the snap_rwsem for write for that.
Signed-off-by: Jeff Layton jlayton@kernel.org Reviewed-by: Ilya Dryomov idryomov@gmail.com Signed-off-by: Ilya Dryomov idryomov@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/ceph/snap.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
--- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -60,12 +60,12 @@ /* * increase ref count for the realm * - * caller must hold snap_rwsem for write. + * caller must hold snap_rwsem. */ void ceph_get_snap_realm(struct ceph_mds_client *mdsc, struct ceph_snap_realm *realm) { - lockdep_assert_held_write(&mdsc->snap_rwsem); + lockdep_assert_held(&mdsc->snap_rwsem);
dout("get_realm %p %d -> %d\n", realm, atomic_read(&realm->nref), atomic_read(&realm->nref)+1); @@ -139,7 +139,7 @@ static struct ceph_snap_realm *ceph_crea /* * lookup the realm rooted at @ino. * - * caller must hold snap_rwsem for write. + * caller must hold snap_rwsem. */ static struct ceph_snap_realm *__lookup_snap_realm(struct ceph_mds_client *mdsc, u64 ino) @@ -147,7 +147,7 @@ static struct ceph_snap_realm *__lookup_ struct rb_node *n = mdsc->snap_realms.rb_node; struct ceph_snap_realm *r;
- lockdep_assert_held_write(&mdsc->snap_rwsem); + lockdep_assert_held(&mdsc->snap_rwsem);
while (n) { r = rb_entry(n, struct ceph_snap_realm, node);
From: Jeff Layton jlayton@kernel.org
commit 8434ffe71c874b9c4e184b88d25de98c2bf5fe3f upstream.
There is a race in ceph_put_snap_realm. The change to the nref and the spinlock acquisition are not done atomically, so you could decrement nref, and before you take the spinlock, the nref is incremented again. At that point, you end up putting it on the empty list when it shouldn't be there. Eventually __cleanup_empty_realms runs and frees it when it's still in-use.
Fix this by protecting the 1->0 transition with atomic_dec_and_lock, and just drop the spinlock if we can get the rwsem.
Because these objects can also undergo a 0->1 refcount transition, we must protect that change as well with the spinlock. Increment locklessly unless the value is at 0, in which case we take the spinlock, increment and then take it off the empty list if it did the 0->1 transition.
With these changes, I'm removing the dout() messages from these functions, as well as in __put_snap_realm. They've always been racy, and it's better to not print values that may be misleading.
Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/46419 Reported-by: Mark Nelson mnelson@redhat.com Signed-off-by: Jeff Layton jlayton@kernel.org Reviewed-by: Luis Henriques lhenriques@suse.de Signed-off-by: Ilya Dryomov idryomov@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/ceph/snap.c | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-)
--- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -67,19 +67,19 @@ void ceph_get_snap_realm(struct ceph_mds { lockdep_assert_held(&mdsc->snap_rwsem);
- dout("get_realm %p %d -> %d\n", realm, - atomic_read(&realm->nref), atomic_read(&realm->nref)+1); /* - * since we _only_ increment realm refs or empty the empty - * list with snap_rwsem held, adjusting the empty list here is - * safe. we do need to protect against concurrent empty list - * additions, however. + * The 0->1 and 1->0 transitions must take the snap_empty_lock + * atomically with the refcount change. Go ahead and bump the + * nref here, unless it's 0, in which case we take the spinlock + * and then do the increment and remove it from the list. */ - if (atomic_inc_return(&realm->nref) == 1) { - spin_lock(&mdsc->snap_empty_lock); + if (atomic_inc_not_zero(&realm->nref)) + return; + + spin_lock(&mdsc->snap_empty_lock); + if (atomic_inc_return(&realm->nref) == 1) list_del_init(&realm->empty_item); - spin_unlock(&mdsc->snap_empty_lock); - } + spin_unlock(&mdsc->snap_empty_lock); }
static void __insert_snap_realm(struct rb_root *root, @@ -208,28 +208,28 @@ static void __put_snap_realm(struct ceph { lockdep_assert_held_write(&mdsc->snap_rwsem);
- dout("__put_snap_realm %llx %p %d -> %d\n", realm->ino, realm, - atomic_read(&realm->nref), atomic_read(&realm->nref)-1); + /* + * We do not require the snap_empty_lock here, as any caller that + * increments the value must hold the snap_rwsem. + */ if (atomic_dec_and_test(&realm->nref)) __destroy_snap_realm(mdsc, realm); }
/* - * caller needn't hold any locks + * See comments in ceph_get_snap_realm. Caller needn't hold any locks. */ void ceph_put_snap_realm(struct ceph_mds_client *mdsc, struct ceph_snap_realm *realm) { - dout("put_snap_realm %llx %p %d -> %d\n", realm->ino, realm, - atomic_read(&realm->nref), atomic_read(&realm->nref)-1); - if (!atomic_dec_and_test(&realm->nref)) + if (!atomic_dec_and_lock(&realm->nref, &mdsc->snap_empty_lock)) return;
if (down_write_trylock(&mdsc->snap_rwsem)) { + spin_unlock(&mdsc->snap_empty_lock); __destroy_snap_realm(mdsc, realm); up_write(&mdsc->snap_rwsem); } else { - spin_lock(&mdsc->snap_empty_lock); list_add(&realm->empty_item, &mdsc->snap_empty); spin_unlock(&mdsc->snap_empty_lock); }
From: Nathan Chancellor nathan@kernel.org
commit 848378812e40152abe9b9baf58ce2004f76fb988 upstream.
A recent change in LLVM causes module_{c,d}tor sections to appear when CONFIG_K{A,C}SAN are enabled, which results in orphan section warnings because these are not handled anywhere:
ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.asan.module_ctor) is being placed in '.text.asan.module_ctor' ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.asan.module_dtor) is being placed in '.text.asan.module_dtor' ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.tsan.module_ctor) is being placed in '.text.tsan.module_ctor'
Fangrui explains: "the function asan.module_ctor has the SHF_GNU_RETAIN flag, so it is in a separate section even with -fno-function-sections (default)".
Place them in the TEXT_TEXT section so that these technologies continue to work with the newer compiler versions. All of the KASAN and KCSAN KUnit tests continue to pass after this change.
Cc: stable@vger.kernel.org Link: https://github.com/ClangBuiltLinux/linux/issues/1432 Link: https://github.com/llvm/llvm-project/commit/7b789562244ee941b7bf2cefeb3fc08a... Signed-off-by: Nathan Chancellor nathan@kernel.org Reviewed-by: Nick Desaulniers ndesaulniers@google.com Reviewed-by: Fangrui Song maskray@google.com Acked-by: Marco Elver elver@google.com Signed-off-by: Kees Cook keescook@chromium.org Link: https://lore.kernel.org/r/20210731023107.1932981-1-nathan@kernel.org [nc: Resolve conflict due to lack of cf68fffb66d60] Signed-off-by: Nathan Chancellor nathan@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/asm-generic/vmlinux.lds.h | 1 + 1 file changed, 1 insertion(+)
--- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -536,6 +536,7 @@ NOINSTR_TEXT \ *(.text..refcount) \ *(.ref.text) \ + *(.text.asan.* .text.tsan.*) \ MEM_KEEP(init.text*) \ MEM_KEEP(exit.text*) \
From: Saeed Mirzamohammadi saeed.mirzamohammadi@oracle.com
[ Upstream commit 327d5b2fee91c404a3956c324193892cf2cc9528 ]
The IOMMU driver calculates the guest addressability for a DMA request based on the value of the mgaw reported from the IOMMU. However, this is a fused value and as mentioned in the spec, the guest width should be calculated based on the minimum of supported adjusted guest address width (SAGAW) and MGAW.
This is from specification: "Guest addressability for a given DMA request is limited to the minimum of the value reported through this field and the adjusted guest address width of the corresponding page-table structure. (Adjusted guest address widths supported by hardware are reported through the SAGAW field)."
This causes domain initialization to fail and following errors appear for EHCI PCI driver:
[ 2.486393] ehci-pci 0000:01:00.4: EHCI Host Controller [ 2.486624] ehci-pci 0000:01:00.4: new USB bus registered, assigned bus number 1 [ 2.489127] ehci-pci 0000:01:00.4: DMAR: Allocating domain failed [ 2.489350] ehci-pci 0000:01:00.4: DMAR: 32bit DMA uses non-identity mapping [ 2.489359] ehci-pci 0000:01:00.4: can't setup: -12 [ 2.489531] ehci-pci 0000:01:00.4: USB bus 1 deregistered [ 2.490023] ehci-pci 0000:01:00.4: init 0000:01:00.4 fail, -12 [ 2.490358] ehci-pci: probe of 0000:01:00.4 failed with error -12
This issue happens when the value of the sagaw corresponds to a 48-bit agaw. This fix updates the calculation of the agaw based on the minimum of IOMMU's sagaw value and MGAW.
This issue happens on the code path of getting a private domain for a device. A private domain was needed when the domain of an iommu group couldn't meet the requirement of a device. The IOMMU core has been evolved to eliminate the need for private domain, hence this code path has alreay been removed from the upstream since commit 327d5b2fee91c ("iommu/vt-d: Allow 32bit devices to uses DMA domain"). Instead of back porting all patches that are required for removing the private domain, this simply fixes it in the affected stable kernel between v4.16 and v5.7.
[baolu: The orignal patch could be found here https://lore.kernel.org/linux-iommu/20210412202736.70765-1-saeed.mirzamohamm.... I added commit message according to Greg's comments at https://lore.kernel.org/linux-iommu/YHZ%2FT9x7Xjf1r6fI@kroah.com/.]
Cc: Joerg Roedel joro@8bytes.org Cc: Ashok Raj ashok.raj@intel.com Cc: stable@vger.kernel.org #v4.16+ Signed-off-by: Saeed Mirzamohammadi saeed.mirzamohammadi@oracle.com Tested-by: Camille Lu camille.lu@hpe.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/iommu/intel-iommu.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
--- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -1853,7 +1853,7 @@ static inline int guestwidth_to_adjustwi static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu, int guest_width) { - int adjust_width, agaw; + int adjust_width, agaw, cap_width; unsigned long sagaw; int err;
@@ -1867,8 +1867,9 @@ static int domain_init(struct dmar_domai domain_reserve_special_ranges(domain);
/* calculate AGAW */ - if (guest_width > cap_mgaw(iommu->cap)) - guest_width = cap_mgaw(iommu->cap); + cap_width = min_t(int, cap_mgaw(iommu->cap), agaw_to_width(iommu->agaw)); + if (guest_width > cap_width) + guest_width = cap_width; domain->gaw = guest_width; adjust_width = guestwidth_to_adjustwidth(guest_width); agaw = width_to_agaw(adjust_width);
linux-stable-mirror@lists.linaro.org