From: Jean Delvare jdelvare@suse.de
[ Upstream commit 3f9e12e0df012c4a9a7fd7eb0d3ae69b459d6b2c ]
In case the WDAT interface is broken, give the user an option to ignore it to let a native driver bind to the watchdog device instead.
Signed-off-by: Jean Delvare jdelvare@suse.de Acked-by: Mika Westerberg mika.westerberg@linux.intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- Documentation/admin-guide/kernel-parameters.txt | 4 ++++ drivers/acpi/acpi_watchdog.c | 12 +++++++++++- 2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ade4e6ec23e03..727a03fb26c99 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -136,6 +136,10 @@ dynamic table installation which will install SSDT tables to /sys/firmware/acpi/tables/dynamic.
+ acpi_no_watchdog [HW,ACPI,WDT] + Ignore the ACPI-based watchdog interface (WDAT) and let + a native driver control the watchdog device instead. + acpi_rsdp= [ACPI,EFI,KEXEC] Pass the RSDP address to the kernel, mostly used on machines running EFI runtime service to boot the diff --git a/drivers/acpi/acpi_watchdog.c b/drivers/acpi/acpi_watchdog.c index b5516b04ffc07..ab6e434b4cee0 100644 --- a/drivers/acpi/acpi_watchdog.c +++ b/drivers/acpi/acpi_watchdog.c @@ -55,12 +55,14 @@ static bool acpi_watchdog_uses_rtc(const struct acpi_table_wdat *wdat) } #endif
+static bool acpi_no_watchdog; + static const struct acpi_table_wdat *acpi_watchdog_get_wdat(void) { const struct acpi_table_wdat *wdat = NULL; acpi_status status;
- if (acpi_disabled) + if (acpi_disabled || acpi_no_watchdog) return NULL;
status = acpi_get_table(ACPI_SIG_WDAT, 0, @@ -88,6 +90,14 @@ bool acpi_has_watchdog(void) } EXPORT_SYMBOL_GPL(acpi_has_watchdog);
+/* ACPI watchdog can be disabled on boot command line */ +static int __init disable_acpi_watchdog(char *str) +{ + acpi_no_watchdog = true; + return 1; +} +__setup("acpi_no_watchdog", disable_acpi_watchdog); + void __init acpi_watchdog_init(void) { const struct acpi_wdat_entry *entries;
From: Mansour Behabadi mansour@oxplot.com
[ Upstream commit e433be929e63265b7412478eb7ff271467aee2d7 ]
Magic Keyboards with more recent firmware (0x0100) report Fn key differently. Without this patch, Fn key may not behave as expected and may not be configurable via hid_apple fnmode module parameter.
Signed-off-by: Mansour Behabadi mansour@oxplot.com Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/hid/hid-apple.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/hid/hid-apple.c b/drivers/hid/hid-apple.c index 6ac8becc2372e..d732d1d10cafb 100644 --- a/drivers/hid/hid-apple.c +++ b/drivers/hid/hid-apple.c @@ -340,7 +340,8 @@ static int apple_input_mapping(struct hid_device *hdev, struct hid_input *hi, unsigned long **bit, int *max) { if (usage->hid == (HID_UP_CUSTOM | 0x0003) || - usage->hid == (HID_UP_MSVENDOR | 0x0003)) { + usage->hid == (HID_UP_MSVENDOR | 0x0003) || + usage->hid == (HID_UP_HPVENDOR2 | 0x0003)) { /* The fn key on Apple USB keyboards */ set_bit(EV_REP, hi->input->evbit); hid_map_usage_clear(hi, usage, bit, max, EV_KEY, KEY_FN);
From: Johan Korsnes jkorsnes@cisco.com
[ Upstream commit 5ebdffd25098898aff1249ae2f7dbfddd76d8f8f ]
In case a report is greater than HID_MAX_BUFFER_SIZE, it is truncated, but the report-number byte is not correctly handled. This results in a off-by-one in the following memset, causing a kernel Oops and ensuing system crash.
Note: With commit 8ec321e96e05 ("HID: Fix slab-out-of-bounds read in hid_field_extract") I no longer hit the kernel Oops as we instead fail "controlled" at probe if there is a report too long in the HID report-descriptor. hid_report_raw_event() is an exported symbol, so presumabely we cannot always rely on this being the case.
Fixes: 966922f26c7f ("HID: fix a crash in hid_report_raw_event() function.") Signed-off-by: Johan Korsnes jkorsnes@cisco.com Cc: Armando Visconti armando.visconti@st.com Cc: Jiri Kosina jkosina@suse.cz Cc: Alan Stern stern@rowland.harvard.edu Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/hid/hid-core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/hid/hid-core.c b/drivers/hid/hid-core.c index 851fe54ea59e7..359616e3efbbb 100644 --- a/drivers/hid/hid-core.c +++ b/drivers/hid/hid-core.c @@ -1741,7 +1741,9 @@ int hid_report_raw_event(struct hid_device *hid, int type, u8 *data, u32 size,
rsize = ((report->size - 1) >> 3) + 1;
- if (rsize > HID_MAX_BUFFER_SIZE) + if (report_enum->numbered && rsize >= HID_MAX_BUFFER_SIZE) + rsize = HID_MAX_BUFFER_SIZE - 1; + else if (rsize > HID_MAX_BUFFER_SIZE) rsize = HID_MAX_BUFFER_SIZE;
if (csize < rsize) {
From: Johan Korsnes jkorsnes@cisco.com
[ Upstream commit 84a4062632462c4320704fcdf8e99e89e94c0aba ]
We have a HID touch device that reports its opens and shorts test results in HID buffers of size 8184 bytes. The maximum size of the HID buffer is currently set to 4096 bytes, causing probe of this device to fail. With this patch we increase the maximum size of the HID buffer to 8192 bytes, making device probe and acquisition of said buffers succeed.
Signed-off-by: Johan Korsnes jkorsnes@cisco.com Cc: Alan Stern stern@rowland.harvard.edu Cc: Armando Visconti armando.visconti@st.com Cc: Jiri Kosina jkosina@suse.cz Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/hid.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/hid.h b/include/linux/hid.h index cd41f209043f6..875f71132b142 100644 --- a/include/linux/hid.h +++ b/include/linux/hid.h @@ -492,7 +492,7 @@ struct hid_report_enum { };
#define HID_MIN_BUFFER_SIZE 64 /* make sure there is at least a packet size of space */ -#define HID_MAX_BUFFER_SIZE 4096 /* 4kb */ +#define HID_MAX_BUFFER_SIZE 8192 /* 8kb */ #define HID_CONTROL_FIFO_SIZE 256 /* to init devices with >100 reports */ #define HID_OUTPUT_FIFO_SIZE 64
From: "dan.carpenter@oracle.com" dan.carpenter@oracle.com
[ Upstream commit 5c02c447eaeda29d3da121a2e17b97ccaf579b51 ]
Syzbot reports that "hiddev" is used after it's free in hiddev_disconnect(). The hiddev_disconnect() function sets "hiddev->exist = 0;" so hiddev_release() can free it as soon as we drop the "existancelock" lock. This patch moves the mutex_unlock(&hiddev->existancelock) until after we have finished using it.
Reported-by: syzbot+784ccb935f9900cc7c9e@syzkaller.appspotmail.com Fixes: 7f77897ef2b6 ("HID: hiddev: fix potential use-after-free") Suggested-by: Alan Stern stern@rowland.harvard.edu Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/hid/usbhid/hiddev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/hid/usbhid/hiddev.c b/drivers/hid/usbhid/hiddev.c index a970b809d778c..4140dea693e90 100644 --- a/drivers/hid/usbhid/hiddev.c +++ b/drivers/hid/usbhid/hiddev.c @@ -932,9 +932,9 @@ void hiddev_disconnect(struct hid_device *hid) hiddev->exist = 0;
if (hiddev->open) { - mutex_unlock(&hiddev->existancelock); hid_hw_close(hiddev->hid); wake_up_interruptible(&hiddev->wait); + mutex_unlock(&hiddev->existancelock); } else { mutex_unlock(&hiddev->existancelock); kfree(hiddev);
From: Christophe JAILLET christophe.jaillet@wanadoo.fr
[ Upstream commit 8d2e77b39b8fecb794e19cd006a12f90b14dd077 ]
They are issues: - if 'input_allocate_device()' fails and return NULL, there is no need to free anything and 'input_free_device()' call is a no-op. It can be axed. - 'ret' is known to be 0 at this point, so we must set it to a meaningful value before returning
Fixes: 2562756dde55 ("HID: add Alps I2C HID Touchpad-Stick support") Signed-off-by: Christophe JAILLET christophe.jaillet@wanadoo.fr Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/hid/hid-alps.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/hid/hid-alps.c b/drivers/hid/hid-alps.c index ae79a7c667372..fa704153cb00d 100644 --- a/drivers/hid/hid-alps.c +++ b/drivers/hid/hid-alps.c @@ -730,7 +730,7 @@ static int alps_input_configured(struct hid_device *hdev, struct hid_input *hi) if (data->has_sp) { input2 = input_allocate_device(); if (!input2) { - input_free_device(input2); + ret = -ENOMEM; goto exit; }
From: "Gustavo A. R. Silva" gustavo@embeddedor.com
[ Upstream commit 54498e8070e19e74498a72c7331348143e7e1f8c ]
Factor out 100 from the equation and do 32-bit arithmetic (3 * clk_mhz / 10) instead of 64-bit.
Notice that clk_mhz is MHz, so the multiplication will never wrap 32 bits and there is no need for div_u64().
Addresses-Coverity: 1458369 ("Unintentional integer overflow") Fixes: 0560ad576268 ("i2c: altera: Add Altera I2C Controller driver") Suggested-by: David Laight David.Laight@ACULAB.COM Signed-off-by: Gustavo A. R. Silva gustavo@embeddedor.com Reviewed-by: Thor Thayer thor.thayer@linux.intel.com Signed-off-by: Wolfram Sang wsa@the-dreams.de Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/i2c/busses/i2c-altera.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/i2c/busses/i2c-altera.c b/drivers/i2c/busses/i2c-altera.c index 5255d3755411b..1de23b4f3809c 100644 --- a/drivers/i2c/busses/i2c-altera.c +++ b/drivers/i2c/busses/i2c-altera.c @@ -171,7 +171,7 @@ static void altr_i2c_init(struct altr_i2c_dev *idev) /* SCL Low Time */ writel(t_low, idev->base + ALTR_I2C_SCL_LOW); /* SDA Hold Time, 300ns */ - writel(div_u64(300 * clk_mhz, 1000), idev->base + ALTR_I2C_SDA_HOLD); + writel(3 * clk_mhz / 10, idev->base + ALTR_I2C_SDA_HOLD);
/* Mask all master interrupt bits */ altr_i2c_int_enable(idev, ALTR_I2C_ALL_IRQ, false);
From: Mika Westerberg mika.westerberg@linux.intel.com
[ Upstream commit cabe17d0173ab04bd3f87b8199ae75f43f1ea473 ]
If the BIOS default timeout for the watchdog is too small userspace may not have enough time to configure new timeout after opening the device before the system is already reset. For this reason program default timeout of 30 seconds in the driver probe and allow userspace to change this from command line or through module parameter (wdat_wdt.timeout).
Reported-by: Jean Delvare jdelvare@suse.de Signed-off-by: Mika Westerberg mika.westerberg@linux.intel.com Reviewed-by: Jean Delvare jdelvare@suse.de Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/watchdog/wdat_wdt.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+)
diff --git a/drivers/watchdog/wdat_wdt.c b/drivers/watchdog/wdat_wdt.c index b069349b52f55..85d6f16d95cca 100644 --- a/drivers/watchdog/wdat_wdt.c +++ b/drivers/watchdog/wdat_wdt.c @@ -54,6 +54,13 @@ module_param(nowayout, bool, 0); MODULE_PARM_DESC(nowayout, "Watchdog cannot be stopped once started (default=" __MODULE_STRING(WATCHDOG_NOWAYOUT) ")");
+#define WDAT_DEFAULT_TIMEOUT 30 + +static int timeout = WDAT_DEFAULT_TIMEOUT; +module_param(timeout, int, 0); +MODULE_PARM_DESC(timeout, "Watchdog timeout in seconds (default=" + __MODULE_STRING(WDAT_DEFAULT_TIMEOUT) ")"); + static int wdat_wdt_read(struct wdat_wdt *wdat, const struct wdat_instruction *instr, u32 *value) { @@ -438,6 +445,22 @@ static int wdat_wdt_probe(struct platform_device *pdev)
platform_set_drvdata(pdev, wdat);
+ /* + * Set initial timeout so that userspace has time to configure the + * watchdog properly after it has opened the device. In some cases + * the BIOS default is too short and causes immediate reboot. + */ + if (timeout * 1000 < wdat->wdd.min_hw_heartbeat_ms || + timeout * 1000 > wdat->wdd.max_hw_heartbeat_ms) { + dev_warn(dev, "Invalid timeout %d given, using %d\n", + timeout, WDAT_DEFAULT_TIMEOUT); + timeout = WDAT_DEFAULT_TIMEOUT; + } + + ret = wdat_wdt_set_timeout(&wdat->wdd, timeout); + if (ret) + return ret; + watchdog_set_nowayout(&wdat->wdd, nowayout); return devm_watchdog_register_device(dev, &wdat->wdd); }
From: Kai-Heng Feng kai.heng.feng@canonical.com
[ Upstream commit be0aba826c4a6ba5929def1962a90d6127871969 ]
The Surfbook E11B uses the SIPODEV SP1064 touchpad, which does not supply descriptors, so it has to be added to the override list.
BugLink: https://bugs.launchpad.net/bugs/1858299 Signed-off-by: Kai-Heng Feng kai.heng.feng@canonical.com Reviewed-by: Hans de Goede hdegoede@redhat.com Signed-off-by: Benjamin Tissoires benjamin.tissoires@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/hid/i2c-hid/i2c-hid-dmi-quirks.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/drivers/hid/i2c-hid/i2c-hid-dmi-quirks.c b/drivers/hid/i2c-hid/i2c-hid-dmi-quirks.c index d31ea82b84c17..a66f08041a1aa 100644 --- a/drivers/hid/i2c-hid/i2c-hid-dmi-quirks.c +++ b/drivers/hid/i2c-hid/i2c-hid-dmi-quirks.c @@ -341,6 +341,14 @@ static const struct dmi_system_id i2c_hid_dmi_desc_override_table[] = { }, .driver_data = (void *)&sipodev_desc }, + { + .ident = "Trekstor SURFBOOK E11B", + .matches = { + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "TREKSTOR"), + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "SURFBOOK E11B"), + }, + .driver_data = (void *)&sipodev_desc + }, { .ident = "Direkt-Tek DTLAPY116-2", .matches = {
From: Victor Kamensky kamensky@cisco.com
[ Upstream commit d3f703c4359ff06619b2322b91f69710453e6b6d ]
Observed that when kernel is built with Yocto mips64-poky-linux-gcc, and mips64-poky-linux-gnun32-gcc toolchain, resulting vdso contains 'jalr t9' instructions in its code and since in vdso case nobody sets GOT table code crashes when instruction reached. On other hand observed that when kernel is built mips-poky-linux-gcc toolchain, the same 'jalr t9' instruction are replaced with PC relative function calls using 'bal' instructions.
The difference boils down to -mrelax-pic-calls and -mexplicit-relocs gcc options that gets different default values depending on gcc target triplets and corresponding binutils. -mrelax-pic-calls got enabled by default only in mips-poky-linux-gcc case. MIPS binutils ld relies on R_MIPS_JALR relocation to convert 'jalr t9' into 'bal' and such relocation is generated only if -mrelax-pic-calls option is on.
Please note 'jalr t9' conversion to 'bal' can happen only to static functions. These static PIC calls use mips local GOT entries that are supposed to be filled with start of DSO value by run-time linker (missing in VDSO case) and they do not have dynamic relocations. Global mips GOT entries must have dynamic relocations and they should be prevented by cmd_vdso_check Makefile rule.
Solution call out -mrelax-pic-calls and -mexplicit-relocs options explicitly while compiling MIPS vdso code. That would get correct and consistent between different toolchains behaviour.
Reported-by: Bruce Ashfield bruce.ashfield@gmail.com Signed-off-by: Victor Kamensky kamensky@cisco.com Signed-off-by: Paul Burton paulburton@kernel.org Cc: linux-mips@vger.kernel.org Cc: Ralf Baechle ralf@linux-mips.org Cc: James Hogan jhogan@kernel.org Cc: Vincenzo Frascino vincenzo.frascino@arm.com Cc: richard.purdie@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/mips/vdso/Makefile | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/mips/vdso/Makefile b/arch/mips/vdso/Makefile index e05938997e696..96afd73c94e8a 100644 --- a/arch/mips/vdso/Makefile +++ b/arch/mips/vdso/Makefile @@ -29,6 +29,7 @@ endif cflags-vdso := $(ccflags-vdso) \ $(filter -W%,$(filter-out -Wa$(comma)%,$(KBUILD_CFLAGS))) \ -O3 -g -fPIC -fno-strict-aliasing -fno-common -fno-builtin -G 0 \ + -mrelax-pic-calls -mexplicit-relocs \ -fno-stack-protector -fno-jump-tables -DDISABLE_BRANCH_PROFILING \ $(call cc-option, -fno-asynchronous-unwind-tables) \ $(call cc-option, -fno-stack-protector)
From: Paul Burton paulburton@kernel.org
[ Upstream commit 07015d7a103c4420b69a287b8ef4d2535c0f4106 ]
A check we're about to add to pick up on function calls that depend on bogus use of the GOT in the VDSO picked up on instances of such function calls in microMIPS builds. Since the code appears genuinely problematic, and given the relatively small amount of use & testing that microMIPS sees, go ahead & disable the VDSO for microMIPS builds.
Signed-off-by: Paul Burton paulburton@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/mips/vdso/Makefile | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/arch/mips/vdso/Makefile b/arch/mips/vdso/Makefile index 96afd73c94e8a..e8585a22b925c 100644 --- a/arch/mips/vdso/Makefile +++ b/arch/mips/vdso/Makefile @@ -48,6 +48,8 @@ endif
CFLAGS_REMOVE_vgettimeofday.o = -pg
+DISABLE_VDSO := n + # # For the pre-R6 code in arch/mips/vdso/vdso.h for locating # the base address of VDSO, the linker will emit a R_MIPS_PC32 @@ -61,11 +63,24 @@ CFLAGS_REMOVE_vgettimeofday.o = -pg ifndef CONFIG_CPU_MIPSR6 ifeq ($(call ld-ifversion, -lt, 225000000, y),y) $(warning MIPS VDSO requires binutils >= 2.25) - obj-vdso-y := $(filter-out vgettimeofday.o, $(obj-vdso-y)) - ccflags-vdso += -DDISABLE_MIPS_VDSO + DISABLE_VDSO := y endif endif
+# +# GCC (at least up to version 9.2) appears to emit function calls that make use +# of the GOT when targeting microMIPS, which we can't use in the VDSO due to +# the lack of relocations. As such, we disable the VDSO for microMIPS builds. +# +ifdef CONFIG_CPU_MICROMIPS + DISABLE_VDSO := y +endif + +ifeq ($(DISABLE_VDSO),y) + obj-vdso-y := $(filter-out vgettimeofday.o, $(obj-vdso-y)) + ccflags-vdso += -DDISABLE_MIPS_VDSO +endif + # VDSO linker flags. VDSO_LDFLAGS := \ -Wl,-Bsymbolic -Wl,--no-undefined -Wl,-soname=linux-vdso.so.1 \
From: Victor Kamensky kamensky@cisco.com
[ Upstream commit 976c23af3ee5bd3447a7bfb6c356ceb4acf264a6 ]
vdso shared object cannot have GOT based PIC 'jalr t9' calls because nobody set GOT table in vdso. Contributing into vdso .o files are compiled in PIC mode and as result for internal static functions calls compiler will generate 'jalr t9' instructions. Those are supposed to be converted into PC relative 'bal' calls by linker when relocation are processed.
Mips global GOT entries do have dynamic relocations and they will be caught by cmd_vdso_check Makefile rule. Static PIC calls go through mips local GOT entries that do not have dynamic relocations. For those 'jalr t9' calls could be present but without dynamic relocations and they need to be converted to 'bal' calls by linker.
Add additional build time check to make sure that no 'jalr t9' slip through because of some toolchain misconfiguration that prevents 'jalr t9' to 'bal' conversion.
Signed-off-by: Victor Kamensky kamensky@cisco.com Signed-off-by: Paul Burton paulburton@kernel.org Cc: linux-mips@vger.kernel.org Cc: Ralf Baechle ralf@linux-mips.org Cc: James Hogan jhogan@kernel.org Cc: Vincenzo Frascino vincenzo.frascino@arm.com Cc: bruce.ashfield@gmail.com Cc: richard.purdie@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/mips/vdso/Makefile | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/mips/vdso/Makefile b/arch/mips/vdso/Makefile index e8585a22b925c..bfb65b2d57c7f 100644 --- a/arch/mips/vdso/Makefile +++ b/arch/mips/vdso/Makefile @@ -93,12 +93,18 @@ GCOV_PROFILE := n UBSAN_SANITIZE := n KCOV_INSTRUMENT := n
+# Check that we don't have PIC 'jalr t9' calls left +quiet_cmd_vdso_mips_check = VDSOCHK $@ + cmd_vdso_mips_check = if $(OBJDUMP) --disassemble $@ | egrep -h "jalr.*t9" > /dev/null; \ + then (echo >&2 "$@: PIC 'jalr t9' calls are not supported"; \ + rm -f $@; /bin/false); fi + # # Shared build commands. #
quiet_cmd_vdsold_and_vdso_check = LD $@ - cmd_vdsold_and_vdso_check = $(cmd_vdsold); $(cmd_vdso_check) + cmd_vdsold_and_vdso_check = $(cmd_vdsold); $(cmd_vdso_check); $(cmd_vdso_mips_check)
quiet_cmd_vdsold = VDSO $@ cmd_vdsold = $(CC) $(c_flags) $(VDSO_LDFLAGS) \
From: Mark Tomlinson mark.tomlinson@alliedtelesis.co.nz
[ Upstream commit 97e914b7de3c943011779b979b8093fdc0d85722 ]
The Cavium Octeon CPU uses a special sync instruction for implementing wmb, and due to a CPU bug, the instruction must appear twice. A macro had been defined to hide this:
#define __SYNC_rpt(type) (1 + (type == __SYNC_wmb))
which was intended to evaluate to 2 for __SYNC_wmb, and 1 for any other type of sync. However, this expression is evaluated by the assembler, and not the compiler, and the result of '==' in the assembler is 0 or -1, not 0 or 1 as it is in C. The net result was wmb() producing no code at all. The simple fix in this patch is to change the '+' to '-'.
Fixes: bf92927251b3 ("MIPS: barrier: Add __SYNC() infrastructure") Signed-off-by: Mark Tomlinson mark.tomlinson@alliedtelesis.co.nz Tested-by: Chris Packham chris.packham@alliedtelesis.co.nz Signed-off-by: Paul Burton paulburton@kernel.org Cc: linux-mips@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/mips/include/asm/sync.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/mips/include/asm/sync.h b/arch/mips/include/asm/sync.h index 7c6a1095f5562..aabd097933fe9 100644 --- a/arch/mips/include/asm/sync.h +++ b/arch/mips/include/asm/sync.h @@ -155,9 +155,11 @@ * effective barrier as noted by commit 6b07d38aaa52 ("MIPS: Octeon: Use * optimized memory barrier primitives."). Here we specify that the affected * sync instructions should be emitted twice. + * Note that this expression is evaluated by the assembler (not the compiler), + * and that the assembler evaluates '==' as 0 or -1, not 0 or 1. */ #ifdef CONFIG_CPU_CAVIUM_OCTEON -# define __SYNC_rpt(type) (1 + (type == __SYNC_wmb)) +# define __SYNC_rpt(type) (1 - (type == __SYNC_wmb)) #else # define __SYNC_rpt(type) 1 #endif
From: Christophe JAILLET christophe.jaillet@wanadoo.fr
[ Upstream commit bef8e2dfceed6daeb6ca3e8d33f9c9d43b926580 ]
Pointer on the memory allocated by 'alloc_progmem()' is stored in 'v->load_addr'. So this is this memory that should be freed by 'release_progmem()'.
'release_progmem()' is only a call to 'kfree()'.
With the current code, there is both a double free and a memory leak. Fix it by passing the correct pointer to 'release_progmem()'.
Fixes: e01402b115ccc ("More AP / SP bits for the 34K, the Malta bits and things. Still wants") Signed-off-by: Christophe JAILLET christophe.jaillet@wanadoo.fr Signed-off-by: Paul Burton paulburton@kernel.org Cc: ralf@linux-mips.org Cc: linux-mips@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: kernel-janitors@vger.kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/mips/kernel/vpe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/mips/kernel/vpe.c b/arch/mips/kernel/vpe.c index 6176b9acba950..d0d832ab3d3b8 100644 --- a/arch/mips/kernel/vpe.c +++ b/arch/mips/kernel/vpe.c @@ -134,7 +134,7 @@ void release_vpe(struct vpe *v) { list_del(&v->list); if (v->load_addr) - release_progmem(v); + release_progmem(v->load_addr); kfree(v); }
From: Hanno Zulla kontakt@hanno.de
[ Upstream commit 789a2c250340666220fa74bc6c8f58497e3863b3 ]
The struct *bigben was allocated via devm_kzalloc() and then used as a parameter in input_ff_create_memless(). This caused a double kfree during removal of the device, since both the managed resource API and ml_ff_destroy() in drivers/input/ff-memless.c would call kfree() on it.
Signed-off-by: Hanno Zulla kontakt@hanno.de Signed-off-by: Benjamin Tissoires benjamin.tissoires@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/hid/hid-bigbenff.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/hid/hid-bigbenff.c b/drivers/hid/hid-bigbenff.c index 3f6abd190df43..f7e85bacb6889 100644 --- a/drivers/hid/hid-bigbenff.c +++ b/drivers/hid/hid-bigbenff.c @@ -220,10 +220,16 @@ static void bigben_worker(struct work_struct *work) static int hid_bigben_play_effect(struct input_dev *dev, void *data, struct ff_effect *effect) { - struct bigben_device *bigben = data; + struct hid_device *hid = input_get_drvdata(dev); + struct bigben_device *bigben = hid_get_drvdata(hid); u8 right_motor_on; u8 left_motor_force;
+ if (!bigben) { + hid_err(hid, "no device data\n"); + return 0; + } + if (effect->type != FF_RUMBLE) return 0;
@@ -341,7 +347,7 @@ static int bigben_probe(struct hid_device *hid,
INIT_WORK(&bigben->worker, bigben_worker);
- error = input_ff_create_memless(hidinput->input, bigben, + error = input_ff_create_memless(hidinput->input, NULL, hid_bigben_play_effect); if (error) return error;
From: Hanno Zulla kontakt@hanno.de
[ Upstream commit 976a54d0f4202cb412a3b1fc7f117e1d97db35f3 ]
It's required to call hid_hw_stop() once hid_hw_start() was called previously, so error cases need to handle this. Also, hid_hw_close() is not necessary during removal.
Signed-off-by: Hanno Zulla kontakt@hanno.de Signed-off-by: Benjamin Tissoires benjamin.tissoires@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/hid/hid-bigbenff.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/drivers/hid/hid-bigbenff.c b/drivers/hid/hid-bigbenff.c index f7e85bacb6889..f8c552b64a899 100644 --- a/drivers/hid/hid-bigbenff.c +++ b/drivers/hid/hid-bigbenff.c @@ -305,7 +305,6 @@ static void bigben_remove(struct hid_device *hid) struct bigben_device *bigben = hid_get_drvdata(hid);
cancel_work_sync(&bigben->worker); - hid_hw_close(hid); hid_hw_stop(hid); }
@@ -350,7 +349,7 @@ static int bigben_probe(struct hid_device *hid, error = input_ff_create_memless(hidinput->input, NULL, hid_bigben_play_effect); if (error) - return error; + goto error_hw_stop;
name_sz = strlen(dev_name(&hid->dev)) + strlen(":red:bigben#") + 1;
@@ -360,8 +359,10 @@ static int bigben_probe(struct hid_device *hid, sizeof(struct led_classdev) + name_sz, GFP_KERNEL ); - if (!led) - return -ENOMEM; + if (!led) { + error = -ENOMEM; + goto error_hw_stop; + } name = (void *)(&led[1]); snprintf(name, name_sz, "%s:red:bigben%d", @@ -375,7 +376,7 @@ static int bigben_probe(struct hid_device *hid, bigben->leds[n] = led; error = devm_led_classdev_register(&hid->dev, led); if (error) - return error; + goto error_hw_stop; }
/* initial state: LED1 is on, no rumble effect */ @@ -389,6 +390,10 @@ static int bigben_probe(struct hid_device *hid, hid_info(hid, "LED and force feedback support for BigBen gamepad\n");
return 0; + +error_hw_stop: + hid_hw_stop(hid); + return error; }
static __u8 *bigben_report_fixup(struct hid_device *hid, __u8 *rdesc,
From: Hanno Zulla kontakt@hanno.de
[ Upstream commit 4eb1b01de5b9d8596d6c103efcf1a15cfc1bedf7 ]
It's possible that there is scheduled work left while the device is already being removed, which can cause a kernel crash. Adding a flag will avoid this.
Signed-off-by: Hanno Zulla kontakt@hanno.de Signed-off-by: Benjamin Tissoires benjamin.tissoires@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/hid/hid-bigbenff.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/hid/hid-bigbenff.c b/drivers/hid/hid-bigbenff.c index f8c552b64a899..db6da21ade063 100644 --- a/drivers/hid/hid-bigbenff.c +++ b/drivers/hid/hid-bigbenff.c @@ -174,6 +174,7 @@ static __u8 pid0902_rdesc_fixed[] = { struct bigben_device { struct hid_device *hid; struct hid_report *report; + bool removed; u8 led_state; /* LED1 = 1 .. LED4 = 8 */ u8 right_motor_on; /* right motor off/on 0/1 */ u8 left_motor_force; /* left motor force 0-255 */ @@ -190,6 +191,9 @@ static void bigben_worker(struct work_struct *work) struct bigben_device, worker); struct hid_field *report_field = bigben->report->field[0];
+ if (bigben->removed) + return; + if (bigben->work_led) { bigben->work_led = false; report_field->value[0] = 0x01; /* 1 = led message */ @@ -304,6 +308,7 @@ static void bigben_remove(struct hid_device *hid) { struct bigben_device *bigben = hid_get_drvdata(hid);
+ bigben->removed = true; cancel_work_sync(&bigben->worker); hid_hw_stop(hid); } @@ -324,6 +329,7 @@ static int bigben_probe(struct hid_device *hid, return -ENOMEM; hid_set_drvdata(hid, bigben); bigben->hid = hid; + bigben->removed = false;
error = hid_parse(hid); if (error) {
From: Greentime Hu greentime.hu@sifive.com
[ Upstream commit c68a9032299e837b56d356de9250c93094f7e0e3 ]
When the kernel is running in S-mode, the expectation is that the bootloader or SBI layer will configure the PMP to allow the kernel to access physical memory. But, when the kernel is running in M-mode and is started with the ELF "loader", there's probably no bootloader or SBI layer involved to configure the PMP. Thus, we need to configure the PMP ourselves to enable the kernel to access all regions.
Signed-off-by: Greentime Hu greentime.hu@sifive.com Reviewed-by: Palmer Dabbelt palmerdabbelt@google.com Signed-off-by: Palmer Dabbelt palmerdabbelt@google.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/riscv/include/asm/csr.h | 12 ++++++++++++ arch/riscv/kernel/head.S | 6 ++++++ 2 files changed, 18 insertions(+)
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h index 435b65532e294..8e18d2c64399d 100644 --- a/arch/riscv/include/asm/csr.h +++ b/arch/riscv/include/asm/csr.h @@ -72,6 +72,16 @@ #define EXC_LOAD_PAGE_FAULT 13 #define EXC_STORE_PAGE_FAULT 15
+/* PMP configuration */ +#define PMP_R 0x01 +#define PMP_W 0x02 +#define PMP_X 0x04 +#define PMP_A 0x18 +#define PMP_A_TOR 0x08 +#define PMP_A_NA4 0x10 +#define PMP_A_NAPOT 0x18 +#define PMP_L 0x80 + /* symbolic CSR names: */ #define CSR_CYCLE 0xc00 #define CSR_TIME 0xc01 @@ -100,6 +110,8 @@ #define CSR_MCAUSE 0x342 #define CSR_MTVAL 0x343 #define CSR_MIP 0x344 +#define CSR_PMPCFG0 0x3a0 +#define CSR_PMPADDR0 0x3b0 #define CSR_MHARTID 0xf14
#ifdef CONFIG_RISCV_M_MODE diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S index a4242be66966b..e4d9baf973232 100644 --- a/arch/riscv/kernel/head.S +++ b/arch/riscv/kernel/head.S @@ -58,6 +58,12 @@ _start_kernel: /* Reset all registers except ra, a0, a1 */ call reset_regs
+ /* Setup a PMP to permit access to all of memory. */ + li a0, -1 + csrw CSR_PMPADDR0, a0 + li a0, (PMP_A_NAPOT | PMP_R | PMP_W | PMP_X) + csrw CSR_PMPCFG0, a0 + /* * The hartid in a0 is expected later on, and we have no firmware * to hand it to us.
From: Anup Patel anup.patel@wdc.com
[ Upstream commit 6a1ce99dc4bde564e4a072936f9d41f4a439140e ]
Historically, we have been enabling all interrupts for each HART in trap_init(). Ideally, we should only enable M-mode interrupts for M-mode kernel and S-mode interrupts for S-mode kernel in trap_init().
Currently, we get suprious S-mode interrupts on Kendryte K210 board running M-mode NO-MMU kernel because we are enabling all interrupts in trap_init(). To fix this, we only enable software and external interrupt in trap_init(). In future, trap_init() will only enable software interrupt and PLIC driver will enable external interrupt using CPU notifiers.
Fixes: a4c3733d32a7 ("riscv: abstract out CSR names for supervisor vs machine mode") Signed-off-by: Anup Patel anup.patel@wdc.com Reviewed-by: Atish Patra atish.patra@wdc.com Tested-by: Palmer Dabbelt palmerdabbelt@google.com [QMEU virt machine with SMP] [Palmer: Move the Fixes up to a newer commit] Reviewed-by: Palmer Dabbelt palmerdabbelt@google.com Signed-off-by: Palmer Dabbelt palmerdabbelt@google.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/riscv/kernel/traps.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c index f4cad5163bf2c..ffb3d94bf0cc2 100644 --- a/arch/riscv/kernel/traps.c +++ b/arch/riscv/kernel/traps.c @@ -156,6 +156,6 @@ void __init trap_init(void) csr_write(CSR_SCRATCH, 0); /* Set the exception vector address */ csr_write(CSR_TVEC, &handle_exception); - /* Enable all interrupts */ - csr_write(CSR_IE, -1); + /* Enable interrupts */ + csr_write(CSR_IE, IE_SIE | IE_EIE); }
From: Nathan Chancellor natechancellor@gmail.com
[ Upstream commit 72cf3b3df423c1bbd8fa1056fed009d3a260f8a9 ]
Clang does not support this option and errors out:
clang-11: error: unknown argument: '-mexplicit-relocs'
Clang does not appear to need this flag like GCC does because the jalr check that was added in commit 976c23af3ee5 ("mips: vdso: add build time check that no 'jalr t9' calls left") passes just fine with
$ make ARCH=mips CC=clang CROSS_COMPILE=mipsel-linux-gnu- malta_defconfig arch/mips/vdso/
even before commit d3f703c4359f ("mips: vdso: fix 'jalr t9' crash in vdso code").
-mrelax-pic-calls has been supported since clang 9, which is the earliest version that could build a working MIPS kernel, and it is the default for clang so just leave it be.
Fixes: d3f703c4359f ("mips: vdso: fix 'jalr t9' crash in vdso code") Link: https://github.com/ClangBuiltLinux/linux/issues/890 Signed-off-by: Nathan Chancellor natechancellor@gmail.com Reviewed-by: Nick Desaulniers ndesaulniers@google.com Tested-by: Nick Desaulniers ndesaulniers@google.com Signed-off-by: Paul Burton paulburton@kernel.org Cc: Ralf Baechle ralf@linux-mips.org Cc: linux-mips@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: clang-built-linux@googlegroups.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/mips/vdso/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/mips/vdso/Makefile b/arch/mips/vdso/Makefile index bfb65b2d57c7f..2cf4b6131d88d 100644 --- a/arch/mips/vdso/Makefile +++ b/arch/mips/vdso/Makefile @@ -29,7 +29,7 @@ endif cflags-vdso := $(ccflags-vdso) \ $(filter -W%,$(filter-out -Wa$(comma)%,$(KBUILD_CFLAGS))) \ -O3 -g -fPIC -fno-strict-aliasing -fno-common -fno-builtin -G 0 \ - -mrelax-pic-calls -mexplicit-relocs \ + -mrelax-pic-calls $(call cc-option, -mexplicit-relocs) \ -fno-stack-protector -fno-jump-tables -DDISABLE_BRANCH_PROFILING \ $(call cc-option, -fno-asynchronous-unwind-tables) \ $(call cc-option, -fno-stack-protector)
From: Heidi Fahim heidifahim@google.com
[ Upstream commit be886ba90cce2fb2f5a4dbcda8f3be3fd1b2f484 ]
Implemented small fix so that the script changes work directories to the root of the linux kernel source tree from which kunit.py is run. This enables the user to run kunit from any working directory. Originally considered using os.path.join but this is more error prone as we would have to find all file path usages and modify them accordingly. Using os.chdir ensures that the entire script is run within /linux.
Signed-off-by: Heidi Fahim heidifahim@google.com Reviewed-by: Brendan Higgins brendanhiggins@google.com Signed-off-by: Shuah Khan skhan@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- tools/testing/kunit/kunit.py | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/tools/testing/kunit/kunit.py b/tools/testing/kunit/kunit.py index e59eb9e7f9236..180ad1e1b04f9 100755 --- a/tools/testing/kunit/kunit.py +++ b/tools/testing/kunit/kunit.py @@ -24,6 +24,8 @@ KunitResult = namedtuple('KunitResult', ['status','result'])
KunitRequest = namedtuple('KunitRequest', ['raw_output','timeout', 'jobs', 'build_dir', 'defconfig'])
+KernelDirectoryPath = sys.argv[0].split('tools/testing/kunit/')[0] + class KunitStatus(Enum): SUCCESS = auto() CONFIG_FAILURE = auto() @@ -35,6 +37,13 @@ def create_default_kunitconfig(): shutil.copyfile('arch/um/configs/kunit_defconfig', kunit_kernel.kunitconfig_path)
+def get_kernel_root_path(): + parts = sys.argv[0] if not __file__ else __file__ + parts = os.path.realpath(parts).split('tools/testing/kunit') + if len(parts) != 2: + sys.exit(1) + return parts[0] + def run_tests(linux: kunit_kernel.LinuxSourceTree, request: KunitRequest) -> KunitResult: config_start = time.time() @@ -114,6 +123,9 @@ def main(argv, linux=None): cli_args = parser.parse_args(argv)
if cli_args.subcommand == 'run': + if get_kernel_root_path(): + os.chdir(get_kernel_root_path()) + if cli_args.build_dir: if not os.path.exists(cli_args.build_dir): os.mkdir(cli_args.build_dir)
From: Michael Ellerman mpe@ellerman.id.au
[ Upstream commit ef89d0545132d685f73da6f58b7e7fe002536f91 ]
Currently if you build with O=... the rseq tests don't build:
$ make O=$PWD/output -C tools/testing/selftests/ TARGETS=rseq make: Entering directory '/linux/tools/testing/selftests' ... make[1]: Entering directory '/linux/tools/testing/selftests/rseq' gcc -O2 -Wall -g -I./ -I../../../../usr/include/ -L./ -Wl,-rpath=./ -shared -fPIC rseq.c -lpthread -o /linux/output/rseq/librseq.so gcc -O2 -Wall -g -I./ -I../../../../usr/include/ -L./ -Wl,-rpath=./ basic_test.c -lpthread -lrseq -o /linux/output/rseq/basic_test /usr/bin/ld: cannot find -lrseq collect2: error: ld returned 1 exit status
This is because the library search path points to the source directory, not the output.
We can fix it by changing the library search path to $(OUTPUT).
Signed-off-by: Michael Ellerman mpe@ellerman.id.au Signed-off-by: Shuah Khan skhan@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- tools/testing/selftests/rseq/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile index d6469535630af..708c1b3452454 100644 --- a/tools/testing/selftests/rseq/Makefile +++ b/tools/testing/selftests/rseq/Makefile @@ -4,7 +4,7 @@ ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),) CLANG_FLAGS += -no-integrated-as endif
-CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/ -L./ -Wl,-rpath=./ \ +CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/ -L$(OUTPUT) -Wl,-rpath=./ \ $(CLANG_FLAGS) LDLIBS += -lpthread
From: Tom Zanussi zanussi@kernel.org
[ Upstream commit 784bd0847eda032ed2f3522f87250655a18c0190 ]
Fix a varargs-related bug in print_synth_event() which resulted in strange output and oopses on 32-bit x86 systems. The problem is that trace_seq_printf() expects the varargs to match the format string, but print_synth_event() was always passing u64 values regardless. This results in unspecified behavior when unpacking with va_arg() in trace_seq_printf().
Add a function that takes the size into account when calling trace_seq_printf().
Before:
modprobe-1731 [003] .... 919.039758: gen_synth_test: next_pid_field=777(null)next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=3(null)my_string_field=thneed my_int_field=598(null)
After:
insmod-1136 [001] .... 36.634590: gen_synth_test: next_pid_field=777 next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=1 my_string_field=thneed my_int_field=598
Link: http://lkml.kernel.org/r/a9b59eb515dbbd7d4abe53b347dccf7a8e285657.1581720155...
Reported-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Tom Zanussi zanussi@kernel.org Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/trace/trace_events_hist.c | 32 +++++++++++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index e10585ef00e15..862fb6d16edb8 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -811,6 +811,29 @@ static const char *synth_field_fmt(char *type) return fmt; }
+static void print_synth_event_num_val(struct trace_seq *s, + char *print_fmt, char *name, + int size, u64 val, char *space) +{ + switch (size) { + case 1: + trace_seq_printf(s, print_fmt, name, (u8)val, space); + break; + + case 2: + trace_seq_printf(s, print_fmt, name, (u16)val, space); + break; + + case 4: + trace_seq_printf(s, print_fmt, name, (u32)val, space); + break; + + default: + trace_seq_printf(s, print_fmt, name, val, space); + break; + } +} + static enum print_line_t print_synth_event(struct trace_iterator *iter, int flags, struct trace_event *event) @@ -849,10 +872,13 @@ static enum print_line_t print_synth_event(struct trace_iterator *iter, } else { struct trace_print_flags __flags[] = { __def_gfpflag_names, {-1, NULL} }; + char *space = (i == se->n_fields - 1 ? "" : " ");
- trace_seq_printf(s, print_fmt, se->fields[i]->name, - entry->fields[n_u64], - i == se->n_fields - 1 ? "" : " "); + print_synth_event_num_val(s, print_fmt, + se->fields[i]->name, + se->fields[i]->size, + entry->fields[n_u64], + space);
if (strcmp(se->fields[i]->type, "gfp_t") == 0) { trace_seq_puts(s, " (");
From: Johannes Berg johannes.berg@intel.com
[ Upstream commit 9951ebfcdf2b97dbb28a5d930458424341e61aa2 ]
If nl80211_parse_he_obss_pd() fails, we leak the previously allocated ACL memory. Free it in this case.
Fixes: 796e90f42b7e ("cfg80211: add support for parsing OBBS_PD attributes") Signed-off-by: Johannes Berg johannes.berg@intel.com Link: https://lore.kernel.org/r/20200221104142.835aba4cdd14.I1923b55ba9989c57e1397... Signed-off-by: Johannes Berg johannes.berg@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- net/wireless/nl80211.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c index 1e97ac5435b23..6032f1cce9416 100644 --- a/net/wireless/nl80211.c +++ b/net/wireless/nl80211.c @@ -4799,8 +4799,7 @@ static int nl80211_start_ap(struct sk_buff *skb, struct genl_info *info) err = nl80211_parse_he_obss_pd( info->attrs[NL80211_ATTR_HE_OBSS_PD], ¶ms.he_obss_pd); - if (err) - return err; + goto out; }
nl80211_calculate_ap_params(¶ms); @@ -4822,6 +4821,7 @@ static int nl80211_start_ap(struct sk_buff *skb, struct genl_info *info) } wdev_unlock(wdev);
+out: kfree(params.acl);
return err;
From: Johannes Berg johannes.berg@intel.com
[ Upstream commit a7ee7d44b57c9ae174088e53a668852b7f4f452d ]
We may end up with a NULL reg_rule after the loop in handle_channel_custom() if the bandwidth didn't fit, check if this is the case and bail out if so.
Signed-off-by: Johannes Berg johannes.berg@intel.com Link: https://lore.kernel.org/r/20200221104449.3b558a50201c.I4ad3725c4dacaefd2d18d... Signed-off-by: Johannes Berg johannes.berg@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- net/wireless/reg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/wireless/reg.c b/net/wireless/reg.c index fff9a74891fc4..1a8218f1bbe07 100644 --- a/net/wireless/reg.c +++ b/net/wireless/reg.c @@ -2276,7 +2276,7 @@ static void handle_channel_custom(struct wiphy *wiphy, break; }
- if (IS_ERR(reg_rule)) { + if (IS_ERR_OR_NULL(reg_rule)) { pr_debug("Disabling freq %d MHz as custom regd has no rule that fits it\n", chan->center_freq); if (wiphy->regulatory_flags & REGULATORY_WIPHY_SELF_MANAGED) {
From: Andrei Otcheretianski andrei.otcheretianski@intel.com
[ Upstream commit 0daa63ed4c6c4302790ce67b7a90c0997ceb7514 ]
The below-mentioned commit changed the code to unlock *inside* the function, but previously the unlock was *outside*. It failed to remove the outer unlock, however, leading to double unlock.
Fix this.
Fixes: 33483a6b88e4 ("mac80211: fix missing unlock on error in ieee80211_mark_sta_auth()") Signed-off-by: Andrei Otcheretianski andrei.otcheretianski@intel.com Link: https://lore.kernel.org/r/20200221104719.cce4741cf6eb.I671567b185c8a4c240937... [rewrite commit message to better explain what happened] Signed-off-by: Johannes Berg johannes.berg@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- net/mac80211/mlme.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/net/mac80211/mlme.c b/net/mac80211/mlme.c index e041af2f021ad..88d7a692a9658 100644 --- a/net/mac80211/mlme.c +++ b/net/mac80211/mlme.c @@ -2959,7 +2959,7 @@ static void ieee80211_rx_mgmt_auth(struct ieee80211_sub_if_data *sdata, (auth_transaction == 2 && ifmgd->auth_data->expected_transaction == 2)) { if (!ieee80211_mark_sta_auth(sdata, bssid)) - goto out_err; + return; /* ignore frame -- wait for timeout */ } else if (ifmgd->auth_data->algorithm == WLAN_AUTH_SAE && auth_transaction == 2) { sdata_info(sdata, "SAE peer confirmed\n"); @@ -2967,10 +2967,6 @@ static void ieee80211_rx_mgmt_auth(struct ieee80211_sub_if_data *sdata, }
cfg80211_rx_mlme_mgmt(sdata->dev, (u8 *)mgmt, len); - return; - out_err: - mutex_unlock(&sdata->local->sta_mtx); - /* ignore frame -- wait for timeout */ }
#define case_WLAN(type) \
From: Igor Druzhinin igor.druzhinin@citrix.com
[ Upstream commit ff6993bb79b9f99bdac0b5378169052931b65432 ]
fc_disc_gpn_id_resp() should be the last function using it so free it here to avoid memory leak.
Link: https://lore.kernel.org/r/1579013000-14570-2-git-send-email-igor.druzhinin@c... Reviewed-by: Hannes Reinecke hare@suse.de Signed-off-by: Igor Druzhinin igor.druzhinin@citrix.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/scsi/libfc/fc_disc.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/libfc/fc_disc.c b/drivers/scsi/libfc/fc_disc.c index 9c5f7c9178c66..2b865c6423e29 100644 --- a/drivers/scsi/libfc/fc_disc.c +++ b/drivers/scsi/libfc/fc_disc.c @@ -628,6 +628,8 @@ static void fc_disc_gpn_id_resp(struct fc_seq *sp, struct fc_frame *fp, } out: kref_put(&rdata->kref, fc_rport_destroy); + if (!IS_ERR(fp)) + fc_frame_free(fp); }
/**
From: Jozsef Kadlecsik kadlec@netfilter.org
[ Upstream commit f66ee0410b1c3481ee75e5db9b34547b4d582465 ]
In the case of huge hash:* types of sets, due to the single spinlock of a set the processing of the whole set under spinlock protection could take too long.
There were four places where the whole hash table of the set was processed from bucket to bucket under holding the spinlock:
- During resizing a set, the original set was locked to exclude kernel side add/del element operations (userspace add/del is excluded by the nfnetlink mutex). The original set is actually just read during the resize, so the spinlocking is replaced with rcu locking of regions. However, thus there can be parallel kernel side add/del of entries. In order not to loose those operations a backlog is added and replayed after the successful resize. - Garbage collection of timed out entries was also protected by the spinlock. In order not to lock too long, region locking is introduced and a single region is processed in one gc go. Also, the simple timer based gc running is replaced with a workqueue based solution. The internal book-keeping (number of elements, size of extensions) is moved to region level due to the region locking. - Adding elements: when the max number of the elements is reached, the gc was called to evict the timed out entries. The new approach is that the gc is called just for the matching region, assuming that if the region (proportionally) seems to be full, then the whole set does. We could scan the other regions to check every entry under rcu locking, but for huge sets it'd mean a slowdown at adding elements. - Listing the set header data: when the set was defined with timeout support, the garbage collector was called to clean up timed out entries to get the correct element numbers and set size values. Now the set is scanned to check non-timed out entries, without actually calling the gc for the whole set.
Thanks to Florian Westphal for helping me to solve the SOFTIRQ-safe -> SOFTIRQ-unsafe lock order issues during working on the patch.
Reported-by: syzbot+4b0e9d4ff3cf117837e5@syzkaller.appspotmail.com Reported-by: syzbot+c27b8d5010f45c666ed1@syzkaller.appspotmail.com Reported-by: syzbot+68a806795ac89df3aa1c@syzkaller.appspotmail.com Fixes: 23c42a403a9c ("netfilter: ipset: Introduction of new commands and protocol version 7") Signed-off-by: Jozsef Kadlecsik kadlec@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/netfilter/ipset/ip_set.h | 11 +- net/netfilter/ipset/ip_set_core.c | 34 +- net/netfilter/ipset/ip_set_hash_gen.h | 633 +++++++++++++++++-------- 3 files changed, 472 insertions(+), 206 deletions(-)
diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h index 908d38dbcb91f..5448c8b443dbf 100644 --- a/include/linux/netfilter/ipset/ip_set.h +++ b/include/linux/netfilter/ipset/ip_set.h @@ -121,6 +121,7 @@ struct ip_set_ext { u32 timeout; u8 packets_op; u8 bytes_op; + bool target; };
struct ip_set; @@ -187,6 +188,14 @@ struct ip_set_type_variant { /* Return true if "b" set is the same as "a" * according to the create set parameters */ bool (*same_set)(const struct ip_set *a, const struct ip_set *b); + /* Region-locking is used */ + bool region_lock; +}; + +struct ip_set_region { + spinlock_t lock; /* Region lock */ + size_t ext_size; /* Size of the dynamic extensions */ + u32 elements; /* Number of elements vs timeout */ };
/* The core set type structure */ @@ -501,7 +510,7 @@ ip_set_init_skbinfo(struct ip_set_skbinfo *skbinfo, }
#define IP_SET_INIT_KEXT(skb, opt, set) \ - { .bytes = (skb)->len, .packets = 1, \ + { .bytes = (skb)->len, .packets = 1, .target = true,\ .timeout = ip_set_adt_opt_timeout(opt, set) }
#define IP_SET_INIT_UEXT(set) \ diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c index 69c107f9ba8db..8dd17589217d7 100644 --- a/net/netfilter/ipset/ip_set_core.c +++ b/net/netfilter/ipset/ip_set_core.c @@ -723,6 +723,20 @@ ip_set_rcu_get(struct net *net, ip_set_id_t index) return set; }
+static inline void +ip_set_lock(struct ip_set *set) +{ + if (!set->variant->region_lock) + spin_lock_bh(&set->lock); +} + +static inline void +ip_set_unlock(struct ip_set *set) +{ + if (!set->variant->region_lock) + spin_unlock_bh(&set->lock); +} + int ip_set_test(ip_set_id_t index, const struct sk_buff *skb, const struct xt_action_param *par, struct ip_set_adt_opt *opt) @@ -744,9 +758,9 @@ ip_set_test(ip_set_id_t index, const struct sk_buff *skb, if (ret == -EAGAIN) { /* Type requests element to be completed */ pr_debug("element must be completed, ADD is triggered\n"); - spin_lock_bh(&set->lock); + ip_set_lock(set); set->variant->kadt(set, skb, par, IPSET_ADD, opt); - spin_unlock_bh(&set->lock); + ip_set_unlock(set); ret = 1; } else { /* --return-nomatch: invert matched element */ @@ -775,9 +789,9 @@ ip_set_add(ip_set_id_t index, const struct sk_buff *skb, !(opt->family == set->family || set->family == NFPROTO_UNSPEC)) return -IPSET_ERR_TYPE_MISMATCH;
- spin_lock_bh(&set->lock); + ip_set_lock(set); ret = set->variant->kadt(set, skb, par, IPSET_ADD, opt); - spin_unlock_bh(&set->lock); + ip_set_unlock(set);
return ret; } @@ -797,9 +811,9 @@ ip_set_del(ip_set_id_t index, const struct sk_buff *skb, !(opt->family == set->family || set->family == NFPROTO_UNSPEC)) return -IPSET_ERR_TYPE_MISMATCH;
- spin_lock_bh(&set->lock); + ip_set_lock(set); ret = set->variant->kadt(set, skb, par, IPSET_DEL, opt); - spin_unlock_bh(&set->lock); + ip_set_unlock(set);
return ret; } @@ -1264,9 +1278,9 @@ ip_set_flush_set(struct ip_set *set) { pr_debug("set: %s\n", set->name);
- spin_lock_bh(&set->lock); + ip_set_lock(set); set->variant->flush(set); - spin_unlock_bh(&set->lock); + ip_set_unlock(set); }
static int ip_set_flush(struct net *net, struct sock *ctnl, struct sk_buff *skb, @@ -1713,9 +1727,9 @@ call_ad(struct sock *ctnl, struct sk_buff *skb, struct ip_set *set, bool eexist = flags & IPSET_FLAG_EXIST, retried = false;
do { - spin_lock_bh(&set->lock); + ip_set_lock(set); ret = set->variant->uadt(set, tb, adt, &lineno, flags, retried); - spin_unlock_bh(&set->lock); + ip_set_unlock(set); retried = true; } while (ret == -EAGAIN && set->variant->resize && diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h index 7480ce55b5c85..71e93eac08319 100644 --- a/net/netfilter/ipset/ip_set_hash_gen.h +++ b/net/netfilter/ipset/ip_set_hash_gen.h @@ -7,13 +7,21 @@ #include <linux/rcupdate.h> #include <linux/jhash.h> #include <linux/types.h> +#include <linux/netfilter/nfnetlink.h> #include <linux/netfilter/ipset/ip_set.h>
-#define __ipset_dereference_protected(p, c) rcu_dereference_protected(p, c) -#define ipset_dereference_protected(p, set) \ - __ipset_dereference_protected(p, lockdep_is_held(&(set)->lock)) - -#define rcu_dereference_bh_nfnl(p) rcu_dereference_bh_check(p, 1) +#define __ipset_dereference(p) \ + rcu_dereference_protected(p, 1) +#define ipset_dereference_nfnl(p) \ + rcu_dereference_protected(p, \ + lockdep_nfnl_is_held(NFNL_SUBSYS_IPSET)) +#define ipset_dereference_set(p, set) \ + rcu_dereference_protected(p, \ + lockdep_nfnl_is_held(NFNL_SUBSYS_IPSET) || \ + lockdep_is_held(&(set)->lock)) +#define ipset_dereference_bh_nfnl(p) \ + rcu_dereference_bh_check(p, \ + lockdep_nfnl_is_held(NFNL_SUBSYS_IPSET))
/* Hashing which uses arrays to resolve clashing. The hash table is resized * (doubled) when searching becomes too long. @@ -72,11 +80,35 @@ struct hbucket { __aligned(__alignof__(u64)); };
+/* Region size for locking == 2^HTABLE_REGION_BITS */ +#define HTABLE_REGION_BITS 10 +#define ahash_numof_locks(htable_bits) \ + ((htable_bits) < HTABLE_REGION_BITS ? 1 \ + : jhash_size((htable_bits) - HTABLE_REGION_BITS)) +#define ahash_sizeof_regions(htable_bits) \ + (ahash_numof_locks(htable_bits) * sizeof(struct ip_set_region)) +#define ahash_region(n, htable_bits) \ + ((n) % ahash_numof_locks(htable_bits)) +#define ahash_bucket_start(h, htable_bits) \ + ((htable_bits) < HTABLE_REGION_BITS ? 0 \ + : (h) * jhash_size(HTABLE_REGION_BITS)) +#define ahash_bucket_end(h, htable_bits) \ + ((htable_bits) < HTABLE_REGION_BITS ? jhash_size(htable_bits) \ + : ((h) + 1) * jhash_size(HTABLE_REGION_BITS)) + +struct htable_gc { + struct delayed_work dwork; + struct ip_set *set; /* Set the gc belongs to */ + u32 region; /* Last gc run position */ +}; + /* The hash table: the table size stored here in order to make resizing easy */ struct htable { atomic_t ref; /* References for resizing */ - atomic_t uref; /* References for dumping */ + atomic_t uref; /* References for dumping and gc */ u8 htable_bits; /* size of hash table == 2^htable_bits */ + u32 maxelem; /* Maxelem per region */ + struct ip_set_region *hregion; /* Region locks and ext sizes */ struct hbucket __rcu *bucket[0]; /* hashtable buckets */ };
@@ -162,6 +194,10 @@ htable_bits(u32 hashsize) #define NLEN 0 #endif /* IP_SET_HASH_WITH_NETS */
+#define SET_ELEM_EXPIRED(set, d) \ + (SET_WITH_TIMEOUT(set) && \ + ip_set_timeout_expired(ext_timeout(d, set))) + #endif /* _IP_SET_HASH_GEN_H */
#ifndef MTYPE @@ -205,10 +241,12 @@ htable_bits(u32 hashsize) #undef mtype_test_cidrs #undef mtype_test #undef mtype_uref -#undef mtype_expire #undef mtype_resize +#undef mtype_ext_size +#undef mtype_resize_ad #undef mtype_head #undef mtype_list +#undef mtype_gc_do #undef mtype_gc #undef mtype_gc_init #undef mtype_variant @@ -247,10 +285,12 @@ htable_bits(u32 hashsize) #define mtype_test_cidrs IPSET_TOKEN(MTYPE, _test_cidrs) #define mtype_test IPSET_TOKEN(MTYPE, _test) #define mtype_uref IPSET_TOKEN(MTYPE, _uref) -#define mtype_expire IPSET_TOKEN(MTYPE, _expire) #define mtype_resize IPSET_TOKEN(MTYPE, _resize) +#define mtype_ext_size IPSET_TOKEN(MTYPE, _ext_size) +#define mtype_resize_ad IPSET_TOKEN(MTYPE, _resize_ad) #define mtype_head IPSET_TOKEN(MTYPE, _head) #define mtype_list IPSET_TOKEN(MTYPE, _list) +#define mtype_gc_do IPSET_TOKEN(MTYPE, _gc_do) #define mtype_gc IPSET_TOKEN(MTYPE, _gc) #define mtype_gc_init IPSET_TOKEN(MTYPE, _gc_init) #define mtype_variant IPSET_TOKEN(MTYPE, _variant) @@ -275,8 +315,7 @@ htable_bits(u32 hashsize) /* The generic hash structure */ struct htype { struct htable __rcu *table; /* the hash table */ - struct timer_list gc; /* garbage collection when timeout enabled */ - struct ip_set *set; /* attached to this ip_set */ + struct htable_gc gc; /* gc workqueue */ u32 maxelem; /* max elements in the hash */ u32 initval; /* random jhash init value */ #ifdef IP_SET_HASH_WITH_MARKMASK @@ -288,21 +327,33 @@ struct htype { #ifdef IP_SET_HASH_WITH_NETMASK u8 netmask; /* netmask value for subnets to store */ #endif + struct list_head ad; /* Resize add|del backlist */ struct mtype_elem next; /* temporary storage for uadd */ #ifdef IP_SET_HASH_WITH_NETS struct net_prefixes nets[NLEN]; /* book-keeping of prefixes */ #endif };
+/* ADD|DEL entries saved during resize */ +struct mtype_resize_ad { + struct list_head list; + enum ipset_adt ad; /* ADD|DEL element */ + struct mtype_elem d; /* Element value */ + struct ip_set_ext ext; /* Extensions for ADD */ + struct ip_set_ext mext; /* Target extensions for ADD */ + u32 flags; /* Flags for ADD */ +}; + #ifdef IP_SET_HASH_WITH_NETS /* Network cidr size book keeping when the hash stores different * sized networks. cidr == real cidr + 1 to support /0. */ static void -mtype_add_cidr(struct htype *h, u8 cidr, u8 n) +mtype_add_cidr(struct ip_set *set, struct htype *h, u8 cidr, u8 n) { int i, j;
+ spin_lock_bh(&set->lock); /* Add in increasing prefix order, so larger cidr first */ for (i = 0, j = -1; i < NLEN && h->nets[i].cidr[n]; i++) { if (j != -1) { @@ -311,7 +362,7 @@ mtype_add_cidr(struct htype *h, u8 cidr, u8 n) j = i; } else if (h->nets[i].cidr[n] == cidr) { h->nets[CIDR_POS(cidr)].nets[n]++; - return; + goto unlock; } } if (j != -1) { @@ -320,24 +371,29 @@ mtype_add_cidr(struct htype *h, u8 cidr, u8 n) } h->nets[i].cidr[n] = cidr; h->nets[CIDR_POS(cidr)].nets[n] = 1; +unlock: + spin_unlock_bh(&set->lock); }
static void -mtype_del_cidr(struct htype *h, u8 cidr, u8 n) +mtype_del_cidr(struct ip_set *set, struct htype *h, u8 cidr, u8 n) { u8 i, j, net_end = NLEN - 1;
+ spin_lock_bh(&set->lock); for (i = 0; i < NLEN; i++) { if (h->nets[i].cidr[n] != cidr) continue; h->nets[CIDR_POS(cidr)].nets[n]--; if (h->nets[CIDR_POS(cidr)].nets[n] > 0) - return; + goto unlock; for (j = i; j < net_end && h->nets[j].cidr[n]; j++) h->nets[j].cidr[n] = h->nets[j + 1].cidr[n]; h->nets[j].cidr[n] = 0; - return; + goto unlock; } +unlock: + spin_unlock_bh(&set->lock); } #endif
@@ -345,7 +401,7 @@ mtype_del_cidr(struct htype *h, u8 cidr, u8 n) static size_t mtype_ahash_memsize(const struct htype *h, const struct htable *t) { - return sizeof(*h) + sizeof(*t); + return sizeof(*h) + sizeof(*t) + ahash_sizeof_regions(t->htable_bits); }
/* Get the ith element from the array block n */ @@ -369,24 +425,29 @@ mtype_flush(struct ip_set *set) struct htype *h = set->data; struct htable *t; struct hbucket *n; - u32 i; - - t = ipset_dereference_protected(h->table, set); - for (i = 0; i < jhash_size(t->htable_bits); i++) { - n = __ipset_dereference_protected(hbucket(t, i), 1); - if (!n) - continue; - if (set->extensions & IPSET_EXT_DESTROY) - mtype_ext_cleanup(set, n); - /* FIXME: use slab cache */ - rcu_assign_pointer(hbucket(t, i), NULL); - kfree_rcu(n, rcu); + u32 r, i; + + t = ipset_dereference_nfnl(h->table); + for (r = 0; r < ahash_numof_locks(t->htable_bits); r++) { + spin_lock_bh(&t->hregion[r].lock); + for (i = ahash_bucket_start(r, t->htable_bits); + i < ahash_bucket_end(r, t->htable_bits); i++) { + n = __ipset_dereference(hbucket(t, i)); + if (!n) + continue; + if (set->extensions & IPSET_EXT_DESTROY) + mtype_ext_cleanup(set, n); + /* FIXME: use slab cache */ + rcu_assign_pointer(hbucket(t, i), NULL); + kfree_rcu(n, rcu); + } + t->hregion[r].ext_size = 0; + t->hregion[r].elements = 0; + spin_unlock_bh(&t->hregion[r].lock); } #ifdef IP_SET_HASH_WITH_NETS memset(h->nets, 0, sizeof(h->nets)); #endif - set->elements = 0; - set->ext_size = 0; }
/* Destroy the hashtable part of the set */ @@ -397,7 +458,7 @@ mtype_ahash_destroy(struct ip_set *set, struct htable *t, bool ext_destroy) u32 i;
for (i = 0; i < jhash_size(t->htable_bits); i++) { - n = __ipset_dereference_protected(hbucket(t, i), 1); + n = __ipset_dereference(hbucket(t, i)); if (!n) continue; if (set->extensions & IPSET_EXT_DESTROY && ext_destroy) @@ -406,6 +467,7 @@ mtype_ahash_destroy(struct ip_set *set, struct htable *t, bool ext_destroy) kfree(n); }
+ ip_set_free(t->hregion); ip_set_free(t); }
@@ -414,28 +476,21 @@ static void mtype_destroy(struct ip_set *set) { struct htype *h = set->data; + struct list_head *l, *lt;
if (SET_WITH_TIMEOUT(set)) - del_timer_sync(&h->gc); + cancel_delayed_work_sync(&h->gc.dwork);
- mtype_ahash_destroy(set, - __ipset_dereference_protected(h->table, 1), true); + mtype_ahash_destroy(set, ipset_dereference_nfnl(h->table), true); + list_for_each_safe(l, lt, &h->ad) { + list_del(l); + kfree(l); + } kfree(h);
set->data = NULL; }
-static void -mtype_gc_init(struct ip_set *set, void (*gc)(struct timer_list *t)) -{ - struct htype *h = set->data; - - timer_setup(&h->gc, gc, 0); - mod_timer(&h->gc, jiffies + IPSET_GC_PERIOD(set->timeout) * HZ); - pr_debug("gc initialized, run in every %u\n", - IPSET_GC_PERIOD(set->timeout)); -} - static bool mtype_same_set(const struct ip_set *a, const struct ip_set *b) { @@ -454,11 +509,9 @@ mtype_same_set(const struct ip_set *a, const struct ip_set *b) a->extensions == b->extensions; }
-/* Delete expired elements from the hashtable */ static void -mtype_expire(struct ip_set *set, struct htype *h) +mtype_gc_do(struct ip_set *set, struct htype *h, struct htable *t, u32 r) { - struct htable *t; struct hbucket *n, *tmp; struct mtype_elem *data; u32 i, j, d; @@ -466,10 +519,12 @@ mtype_expire(struct ip_set *set, struct htype *h) #ifdef IP_SET_HASH_WITH_NETS u8 k; #endif + u8 htable_bits = t->htable_bits;
- t = ipset_dereference_protected(h->table, set); - for (i = 0; i < jhash_size(t->htable_bits); i++) { - n = __ipset_dereference_protected(hbucket(t, i), 1); + spin_lock_bh(&t->hregion[r].lock); + for (i = ahash_bucket_start(r, htable_bits); + i < ahash_bucket_end(r, htable_bits); i++) { + n = __ipset_dereference(hbucket(t, i)); if (!n) continue; for (j = 0, d = 0; j < n->pos; j++) { @@ -485,58 +540,100 @@ mtype_expire(struct ip_set *set, struct htype *h) smp_mb__after_atomic(); #ifdef IP_SET_HASH_WITH_NETS for (k = 0; k < IPSET_NET_COUNT; k++) - mtype_del_cidr(h, + mtype_del_cidr(set, h, NCIDR_PUT(DCIDR_GET(data->cidr, k)), k); #endif + t->hregion[r].elements--; ip_set_ext_destroy(set, data); - set->elements--; d++; } if (d >= AHASH_INIT_SIZE) { if (d >= n->size) { + t->hregion[r].ext_size -= + ext_size(n->size, dsize); rcu_assign_pointer(hbucket(t, i), NULL); kfree_rcu(n, rcu); continue; } tmp = kzalloc(sizeof(*tmp) + - (n->size - AHASH_INIT_SIZE) * dsize, - GFP_ATOMIC); + (n->size - AHASH_INIT_SIZE) * dsize, + GFP_ATOMIC); if (!tmp) - /* Still try to delete expired elements */ + /* Still try to delete expired elements. */ continue; tmp->size = n->size - AHASH_INIT_SIZE; for (j = 0, d = 0; j < n->pos; j++) { if (!test_bit(j, n->used)) continue; data = ahash_data(n, j, dsize); - memcpy(tmp->value + d * dsize, data, dsize); + memcpy(tmp->value + d * dsize, + data, dsize); set_bit(d, tmp->used); d++; } tmp->pos = d; - set->ext_size -= ext_size(AHASH_INIT_SIZE, dsize); + t->hregion[r].ext_size -= + ext_size(AHASH_INIT_SIZE, dsize); rcu_assign_pointer(hbucket(t, i), tmp); kfree_rcu(n, rcu); } } + spin_unlock_bh(&t->hregion[r].lock); }
static void -mtype_gc(struct timer_list *t) +mtype_gc(struct work_struct *work) { - struct htype *h = from_timer(h, t, gc); - struct ip_set *set = h->set; + struct htable_gc *gc; + struct ip_set *set; + struct htype *h; + struct htable *t; + u32 r, numof_locks; + unsigned int next_run; + + gc = container_of(work, struct htable_gc, dwork.work); + set = gc->set; + h = set->data;
- pr_debug("called\n"); spin_lock_bh(&set->lock); - mtype_expire(set, h); + t = ipset_dereference_set(h->table, set); + atomic_inc(&t->uref); + numof_locks = ahash_numof_locks(t->htable_bits); + r = gc->region++; + if (r >= numof_locks) { + r = gc->region = 0; + } + next_run = (IPSET_GC_PERIOD(set->timeout) * HZ) / numof_locks; + if (next_run < HZ/10) + next_run = HZ/10; spin_unlock_bh(&set->lock);
- h->gc.expires = jiffies + IPSET_GC_PERIOD(set->timeout) * HZ; - add_timer(&h->gc); + mtype_gc_do(set, h, t, r); + + if (atomic_dec_and_test(&t->uref) && atomic_read(&t->ref)) { + pr_debug("Table destroy after resize by expire: %p\n", t); + mtype_ahash_destroy(set, t, false); + } + + queue_delayed_work(system_power_efficient_wq, &gc->dwork, next_run); + +} + +static void +mtype_gc_init(struct htable_gc *gc) +{ + INIT_DEFERRABLE_WORK(&gc->dwork, mtype_gc); + queue_delayed_work(system_power_efficient_wq, &gc->dwork, HZ); }
+static int +mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext, + struct ip_set_ext *mext, u32 flags); +static int +mtype_del(struct ip_set *set, void *value, const struct ip_set_ext *ext, + struct ip_set_ext *mext, u32 flags); + /* Resize a hash: create a new hash table with doubling the hashsize * and inserting the elements to it. Repeat until we succeed or * fail due to memory pressures. @@ -547,7 +644,7 @@ mtype_resize(struct ip_set *set, bool retried) struct htype *h = set->data; struct htable *t, *orig; u8 htable_bits; - size_t extsize, dsize = set->dsize; + size_t dsize = set->dsize; #ifdef IP_SET_HASH_WITH_NETS u8 flags; struct mtype_elem *tmp; @@ -555,7 +652,9 @@ mtype_resize(struct ip_set *set, bool retried) struct mtype_elem *data; struct mtype_elem *d; struct hbucket *n, *m; - u32 i, j, key; + struct list_head *l, *lt; + struct mtype_resize_ad *x; + u32 i, j, r, nr, key; int ret;
#ifdef IP_SET_HASH_WITH_NETS @@ -563,10 +662,8 @@ mtype_resize(struct ip_set *set, bool retried) if (!tmp) return -ENOMEM; #endif - rcu_read_lock_bh(); - orig = rcu_dereference_bh_nfnl(h->table); + orig = ipset_dereference_bh_nfnl(h->table); htable_bits = orig->htable_bits; - rcu_read_unlock_bh();
retry: ret = 0; @@ -583,88 +680,124 @@ mtype_resize(struct ip_set *set, bool retried) ret = -ENOMEM; goto out; } + t->hregion = ip_set_alloc(ahash_sizeof_regions(htable_bits)); + if (!t->hregion) { + kfree(t); + ret = -ENOMEM; + goto out; + } t->htable_bits = htable_bits; + t->maxelem = h->maxelem / ahash_numof_locks(htable_bits); + for (i = 0; i < ahash_numof_locks(htable_bits); i++) + spin_lock_init(&t->hregion[i].lock);
- spin_lock_bh(&set->lock); - orig = __ipset_dereference_protected(h->table, 1); - /* There can't be another parallel resizing, but dumping is possible */ + /* There can't be another parallel resizing, + * but dumping, gc, kernel side add/del are possible + */ + orig = ipset_dereference_bh_nfnl(h->table); atomic_set(&orig->ref, 1); atomic_inc(&orig->uref); - extsize = 0; pr_debug("attempt to resize set %s from %u to %u, t %p\n", set->name, orig->htable_bits, htable_bits, orig); - for (i = 0; i < jhash_size(orig->htable_bits); i++) { - n = __ipset_dereference_protected(hbucket(orig, i), 1); - if (!n) - continue; - for (j = 0; j < n->pos; j++) { - if (!test_bit(j, n->used)) + for (r = 0; r < ahash_numof_locks(orig->htable_bits); r++) { + /* Expire may replace a hbucket with another one */ + rcu_read_lock_bh(); + for (i = ahash_bucket_start(r, orig->htable_bits); + i < ahash_bucket_end(r, orig->htable_bits); i++) { + n = __ipset_dereference(hbucket(orig, i)); + if (!n) continue; - data = ahash_data(n, j, dsize); + for (j = 0; j < n->pos; j++) { + if (!test_bit(j, n->used)) + continue; + data = ahash_data(n, j, dsize); + if (SET_ELEM_EXPIRED(set, data)) + continue; #ifdef IP_SET_HASH_WITH_NETS - /* We have readers running parallel with us, - * so the live data cannot be modified. - */ - flags = 0; - memcpy(tmp, data, dsize); - data = tmp; - mtype_data_reset_flags(data, &flags); + /* We have readers running parallel with us, + * so the live data cannot be modified. + */ + flags = 0; + memcpy(tmp, data, dsize); + data = tmp; + mtype_data_reset_flags(data, &flags); #endif - key = HKEY(data, h->initval, htable_bits); - m = __ipset_dereference_protected(hbucket(t, key), 1); - if (!m) { - m = kzalloc(sizeof(*m) + + key = HKEY(data, h->initval, htable_bits); + m = __ipset_dereference(hbucket(t, key)); + nr = ahash_region(key, htable_bits); + if (!m) { + m = kzalloc(sizeof(*m) + AHASH_INIT_SIZE * dsize, GFP_ATOMIC); - if (!m) { - ret = -ENOMEM; - goto cleanup; - } - m->size = AHASH_INIT_SIZE; - extsize += ext_size(AHASH_INIT_SIZE, dsize); - RCU_INIT_POINTER(hbucket(t, key), m); - } else if (m->pos >= m->size) { - struct hbucket *ht; - - if (m->size >= AHASH_MAX(h)) { - ret = -EAGAIN; - } else { - ht = kzalloc(sizeof(*ht) + + if (!m) { + ret = -ENOMEM; + goto cleanup; + } + m->size = AHASH_INIT_SIZE; + t->hregion[nr].ext_size += + ext_size(AHASH_INIT_SIZE, + dsize); + RCU_INIT_POINTER(hbucket(t, key), m); + } else if (m->pos >= m->size) { + struct hbucket *ht; + + if (m->size >= AHASH_MAX(h)) { + ret = -EAGAIN; + } else { + ht = kzalloc(sizeof(*ht) + (m->size + AHASH_INIT_SIZE) * dsize, GFP_ATOMIC); - if (!ht) - ret = -ENOMEM; + if (!ht) + ret = -ENOMEM; + } + if (ret < 0) + goto cleanup; + memcpy(ht, m, sizeof(struct hbucket) + + m->size * dsize); + ht->size = m->size + AHASH_INIT_SIZE; + t->hregion[nr].ext_size += + ext_size(AHASH_INIT_SIZE, + dsize); + kfree(m); + m = ht; + RCU_INIT_POINTER(hbucket(t, key), ht); } - if (ret < 0) - goto cleanup; - memcpy(ht, m, sizeof(struct hbucket) + - m->size * dsize); - ht->size = m->size + AHASH_INIT_SIZE; - extsize += ext_size(AHASH_INIT_SIZE, dsize); - kfree(m); - m = ht; - RCU_INIT_POINTER(hbucket(t, key), ht); - } - d = ahash_data(m, m->pos, dsize); - memcpy(d, data, dsize); - set_bit(m->pos++, m->used); + d = ahash_data(m, m->pos, dsize); + memcpy(d, data, dsize); + set_bit(m->pos++, m->used); + t->hregion[nr].elements++; #ifdef IP_SET_HASH_WITH_NETS - mtype_data_reset_flags(d, &flags); + mtype_data_reset_flags(d, &flags); #endif + } } + rcu_read_unlock_bh(); } - rcu_assign_pointer(h->table, t); - set->ext_size = extsize;
- spin_unlock_bh(&set->lock); + /* There can't be any other writer. */ + rcu_assign_pointer(h->table, t);
/* Give time to other readers of the set */ synchronize_rcu();
pr_debug("set %s resized from %u (%p) to %u (%p)\n", set->name, orig->htable_bits, orig, t->htable_bits, t); - /* If there's nobody else dumping the table, destroy it */ + /* Add/delete elements processed by the SET target during resize. + * Kernel-side add cannot trigger a resize and userspace actions + * are serialized by the mutex. + */ + list_for_each_safe(l, lt, &h->ad) { + x = list_entry(l, struct mtype_resize_ad, list); + if (x->ad == IPSET_ADD) { + mtype_add(set, &x->d, &x->ext, &x->mext, x->flags); + } else { + mtype_del(set, &x->d, NULL, NULL, 0); + } + list_del(l); + kfree(l); + } + /* If there's nobody else using the table, destroy it */ if (atomic_dec_and_test(&orig->uref)) { pr_debug("Table destroy by resize %p\n", orig); mtype_ahash_destroy(set, orig, false); @@ -677,15 +810,44 @@ mtype_resize(struct ip_set *set, bool retried) return ret;
cleanup: + rcu_read_unlock_bh(); atomic_set(&orig->ref, 0); atomic_dec(&orig->uref); - spin_unlock_bh(&set->lock); mtype_ahash_destroy(set, t, false); if (ret == -EAGAIN) goto retry; goto out; }
+/* Get the current number of elements and ext_size in the set */ +static void +mtype_ext_size(struct ip_set *set, u32 *elements, size_t *ext_size) +{ + struct htype *h = set->data; + const struct htable *t; + u32 i, j, r; + struct hbucket *n; + struct mtype_elem *data; + + t = rcu_dereference_bh(h->table); + for (r = 0; r < ahash_numof_locks(t->htable_bits); r++) { + for (i = ahash_bucket_start(r, t->htable_bits); + i < ahash_bucket_end(r, t->htable_bits); i++) { + n = rcu_dereference_bh(hbucket(t, i)); + if (!n) + continue; + for (j = 0; j < n->pos; j++) { + if (!test_bit(j, n->used)) + continue; + data = ahash_data(n, j, set->dsize); + if (!SET_ELEM_EXPIRED(set, data)) + (*elements)++; + } + } + *ext_size += t->hregion[r].ext_size; + } +} + /* Add an element to a hash and update the internal counters when succeeded, * otherwise report the proper error code. */ @@ -698,32 +860,49 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext, const struct mtype_elem *d = value; struct mtype_elem *data; struct hbucket *n, *old = ERR_PTR(-ENOENT); - int i, j = -1; + int i, j = -1, ret; bool flag_exist = flags & IPSET_FLAG_EXIST; bool deleted = false, forceadd = false, reuse = false; - u32 key, multi = 0; + u32 r, key, multi = 0, elements, maxelem;
- if (set->elements >= h->maxelem) { - if (SET_WITH_TIMEOUT(set)) - /* FIXME: when set is full, we slow down here */ - mtype_expire(set, h); - if (set->elements >= h->maxelem && SET_WITH_FORCEADD(set)) + rcu_read_lock_bh(); + t = rcu_dereference_bh(h->table); + key = HKEY(value, h->initval, t->htable_bits); + r = ahash_region(key, t->htable_bits); + atomic_inc(&t->uref); + elements = t->hregion[r].elements; + maxelem = t->maxelem; + if (elements >= maxelem) { + u32 e; + if (SET_WITH_TIMEOUT(set)) { + rcu_read_unlock_bh(); + mtype_gc_do(set, h, t, r); + rcu_read_lock_bh(); + } + maxelem = h->maxelem; + elements = 0; + for (e = 0; e < ahash_numof_locks(t->htable_bits); e++) + elements += t->hregion[e].elements; + if (elements >= maxelem && SET_WITH_FORCEADD(set)) forceadd = true; } + rcu_read_unlock_bh();
- t = ipset_dereference_protected(h->table, set); - key = HKEY(value, h->initval, t->htable_bits); - n = __ipset_dereference_protected(hbucket(t, key), 1); + spin_lock_bh(&t->hregion[r].lock); + n = rcu_dereference_bh(hbucket(t, key)); if (!n) { - if (forceadd || set->elements >= h->maxelem) + if (forceadd || elements >= maxelem) goto set_full; old = NULL; n = kzalloc(sizeof(*n) + AHASH_INIT_SIZE * set->dsize, GFP_ATOMIC); - if (!n) - return -ENOMEM; + if (!n) { + ret = -ENOMEM; + goto unlock; + } n->size = AHASH_INIT_SIZE; - set->ext_size += ext_size(AHASH_INIT_SIZE, set->dsize); + t->hregion[r].ext_size += + ext_size(AHASH_INIT_SIZE, set->dsize); goto copy_elem; } for (i = 0; i < n->pos; i++) { @@ -737,19 +916,16 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext, } data = ahash_data(n, i, set->dsize); if (mtype_data_equal(data, d, &multi)) { - if (flag_exist || - (SET_WITH_TIMEOUT(set) && - ip_set_timeout_expired(ext_timeout(data, set)))) { + if (flag_exist || SET_ELEM_EXPIRED(set, data)) { /* Just the extensions could be overwritten */ j = i; goto overwrite_extensions; } - return -IPSET_ERR_EXIST; + ret = -IPSET_ERR_EXIST; + goto unlock; } /* Reuse first timed out entry */ - if (SET_WITH_TIMEOUT(set) && - ip_set_timeout_expired(ext_timeout(data, set)) && - j == -1) { + if (SET_ELEM_EXPIRED(set, data) && j == -1) { j = i; reuse = true; } @@ -759,16 +935,16 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext, if (!deleted) { #ifdef IP_SET_HASH_WITH_NETS for (i = 0; i < IPSET_NET_COUNT; i++) - mtype_del_cidr(h, + mtype_del_cidr(set, h, NCIDR_PUT(DCIDR_GET(data->cidr, i)), i); #endif ip_set_ext_destroy(set, data); - set->elements--; + t->hregion[r].elements--; } goto copy_data; } - if (set->elements >= h->maxelem) + if (elements >= maxelem) goto set_full; /* Create a new slot */ if (n->pos >= n->size) { @@ -776,28 +952,32 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext, if (n->size >= AHASH_MAX(h)) { /* Trigger rehashing */ mtype_data_next(&h->next, d); - return -EAGAIN; + ret = -EAGAIN; + goto resize; } old = n; n = kzalloc(sizeof(*n) + (old->size + AHASH_INIT_SIZE) * set->dsize, GFP_ATOMIC); - if (!n) - return -ENOMEM; + if (!n) { + ret = -ENOMEM; + goto unlock; + } memcpy(n, old, sizeof(struct hbucket) + old->size * set->dsize); n->size = old->size + AHASH_INIT_SIZE; - set->ext_size += ext_size(AHASH_INIT_SIZE, set->dsize); + t->hregion[r].ext_size += + ext_size(AHASH_INIT_SIZE, set->dsize); }
copy_elem: j = n->pos++; data = ahash_data(n, j, set->dsize); copy_data: - set->elements++; + t->hregion[r].elements++; #ifdef IP_SET_HASH_WITH_NETS for (i = 0; i < IPSET_NET_COUNT; i++) - mtype_add_cidr(h, NCIDR_PUT(DCIDR_GET(d->cidr, i)), i); + mtype_add_cidr(set, h, NCIDR_PUT(DCIDR_GET(d->cidr, i)), i); #endif memcpy(data, d, sizeof(struct mtype_elem)); overwrite_extensions: @@ -820,13 +1000,41 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext, if (old) kfree_rcu(old, rcu); } + ret = 0; +resize: + spin_unlock_bh(&t->hregion[r].lock); + if (atomic_read(&t->ref) && ext->target) { + /* Resize is in process and kernel side add, save values */ + struct mtype_resize_ad *x; + + x = kzalloc(sizeof(struct mtype_resize_ad), GFP_ATOMIC); + if (!x) + /* Don't bother */ + goto out; + x->ad = IPSET_ADD; + memcpy(&x->d, value, sizeof(struct mtype_elem)); + memcpy(&x->ext, ext, sizeof(struct ip_set_ext)); + memcpy(&x->mext, mext, sizeof(struct ip_set_ext)); + x->flags = flags; + spin_lock_bh(&set->lock); + list_add_tail(&x->list, &h->ad); + spin_unlock_bh(&set->lock); + } + goto out;
- return 0; set_full: if (net_ratelimit()) pr_warn("Set %s is full, maxelem %u reached\n", - set->name, h->maxelem); - return -IPSET_ERR_HASH_FULL; + set->name, maxelem); + ret = -IPSET_ERR_HASH_FULL; +unlock: + spin_unlock_bh(&t->hregion[r].lock); +out: + if (atomic_dec_and_test(&t->uref) && atomic_read(&t->ref)) { + pr_debug("Table destroy after resize by add: %p\n", t); + mtype_ahash_destroy(set, t, false); + } + return ret; }
/* Delete an element from the hash and free up space if possible. @@ -840,13 +1048,23 @@ mtype_del(struct ip_set *set, void *value, const struct ip_set_ext *ext, const struct mtype_elem *d = value; struct mtype_elem *data; struct hbucket *n; - int i, j, k, ret = -IPSET_ERR_EXIST; + struct mtype_resize_ad *x = NULL; + int i, j, k, r, ret = -IPSET_ERR_EXIST; u32 key, multi = 0; size_t dsize = set->dsize;
- t = ipset_dereference_protected(h->table, set); + /* Userspace add and resize is excluded by the mutex. + * Kernespace add does not trigger resize. + */ + rcu_read_lock_bh(); + t = rcu_dereference_bh(h->table); key = HKEY(value, h->initval, t->htable_bits); - n = __ipset_dereference_protected(hbucket(t, key), 1); + r = ahash_region(key, t->htable_bits); + atomic_inc(&t->uref); + rcu_read_unlock_bh(); + + spin_lock_bh(&t->hregion[r].lock); + n = rcu_dereference_bh(hbucket(t, key)); if (!n) goto out; for (i = 0, k = 0; i < n->pos; i++) { @@ -857,8 +1075,7 @@ mtype_del(struct ip_set *set, void *value, const struct ip_set_ext *ext, data = ahash_data(n, i, dsize); if (!mtype_data_equal(data, d, &multi)) continue; - if (SET_WITH_TIMEOUT(set) && - ip_set_timeout_expired(ext_timeout(data, set))) + if (SET_ELEM_EXPIRED(set, data)) goto out;
ret = 0; @@ -866,20 +1083,33 @@ mtype_del(struct ip_set *set, void *value, const struct ip_set_ext *ext, smp_mb__after_atomic(); if (i + 1 == n->pos) n->pos--; - set->elements--; + t->hregion[r].elements--; #ifdef IP_SET_HASH_WITH_NETS for (j = 0; j < IPSET_NET_COUNT; j++) - mtype_del_cidr(h, NCIDR_PUT(DCIDR_GET(d->cidr, j)), - j); + mtype_del_cidr(set, h, + NCIDR_PUT(DCIDR_GET(d->cidr, j)), j); #endif ip_set_ext_destroy(set, data);
+ if (atomic_read(&t->ref) && ext->target) { + /* Resize is in process and kernel side del, + * save values + */ + x = kzalloc(sizeof(struct mtype_resize_ad), + GFP_ATOMIC); + if (x) { + x->ad = IPSET_DEL; + memcpy(&x->d, value, + sizeof(struct mtype_elem)); + x->flags = flags; + } + } for (; i < n->pos; i++) { if (!test_bit(i, n->used)) k++; } if (n->pos == 0 && k == 0) { - set->ext_size -= ext_size(n->size, dsize); + t->hregion[r].ext_size -= ext_size(n->size, dsize); rcu_assign_pointer(hbucket(t, key), NULL); kfree_rcu(n, rcu); } else if (k >= AHASH_INIT_SIZE) { @@ -898,7 +1128,8 @@ mtype_del(struct ip_set *set, void *value, const struct ip_set_ext *ext, k++; } tmp->pos = k; - set->ext_size -= ext_size(AHASH_INIT_SIZE, dsize); + t->hregion[r].ext_size -= + ext_size(AHASH_INIT_SIZE, dsize); rcu_assign_pointer(hbucket(t, key), tmp); kfree_rcu(n, rcu); } @@ -906,6 +1137,16 @@ mtype_del(struct ip_set *set, void *value, const struct ip_set_ext *ext, }
out: + spin_unlock_bh(&t->hregion[r].lock); + if (x) { + spin_lock_bh(&set->lock); + list_add(&x->list, &h->ad); + spin_unlock_bh(&set->lock); + } + if (atomic_dec_and_test(&t->uref) && atomic_read(&t->ref)) { + pr_debug("Table destroy after resize by del: %p\n", t); + mtype_ahash_destroy(set, t, false); + } return ret; }
@@ -991,6 +1232,7 @@ mtype_test(struct ip_set *set, void *value, const struct ip_set_ext *ext, int i, ret = 0; u32 key, multi = 0;
+ rcu_read_lock_bh(); t = rcu_dereference_bh(h->table); #ifdef IP_SET_HASH_WITH_NETS /* If we test an IP address and not a network address, @@ -1022,6 +1264,7 @@ mtype_test(struct ip_set *set, void *value, const struct ip_set_ext *ext, goto out; } out: + rcu_read_unlock_bh(); return ret; }
@@ -1033,23 +1276,14 @@ mtype_head(struct ip_set *set, struct sk_buff *skb) const struct htable *t; struct nlattr *nested; size_t memsize; + u32 elements = 0; + size_t ext_size = 0; u8 htable_bits;
- /* If any members have expired, set->elements will be wrong - * mytype_expire function will update it with the right count. - * we do not hold set->lock here, so grab it first. - * set->elements can still be incorrect in the case of a huge set, - * because elements might time out during the listing. - */ - if (SET_WITH_TIMEOUT(set)) { - spin_lock_bh(&set->lock); - mtype_expire(set, h); - spin_unlock_bh(&set->lock); - } - rcu_read_lock_bh(); - t = rcu_dereference_bh_nfnl(h->table); - memsize = mtype_ahash_memsize(h, t) + set->ext_size; + t = rcu_dereference_bh(h->table); + mtype_ext_size(set, &elements, &ext_size); + memsize = mtype_ahash_memsize(h, t) + ext_size + set->ext_size; htable_bits = t->htable_bits; rcu_read_unlock_bh();
@@ -1071,7 +1305,7 @@ mtype_head(struct ip_set *set, struct sk_buff *skb) #endif if (nla_put_net32(skb, IPSET_ATTR_REFERENCES, htonl(set->ref)) || nla_put_net32(skb, IPSET_ATTR_MEMSIZE, htonl(memsize)) || - nla_put_net32(skb, IPSET_ATTR_ELEMENTS, htonl(set->elements))) + nla_put_net32(skb, IPSET_ATTR_ELEMENTS, htonl(elements))) goto nla_put_failure; if (unlikely(ip_set_put_flags(skb, set))) goto nla_put_failure; @@ -1091,15 +1325,15 @@ mtype_uref(struct ip_set *set, struct netlink_callback *cb, bool start)
if (start) { rcu_read_lock_bh(); - t = rcu_dereference_bh_nfnl(h->table); + t = ipset_dereference_bh_nfnl(h->table); atomic_inc(&t->uref); cb->args[IPSET_CB_PRIVATE] = (unsigned long)t; rcu_read_unlock_bh(); } else if (cb->args[IPSET_CB_PRIVATE]) { t = (struct htable *)cb->args[IPSET_CB_PRIVATE]; if (atomic_dec_and_test(&t->uref) && atomic_read(&t->ref)) { - /* Resizing didn't destroy the hash table */ - pr_debug("Table destroy by dump: %p\n", t); + pr_debug("Table destroy after resize " + " by dump: %p\n", t); mtype_ahash_destroy(set, t, false); } cb->args[IPSET_CB_PRIVATE] = 0; @@ -1141,8 +1375,7 @@ mtype_list(const struct ip_set *set, if (!test_bit(i, n->used)) continue; e = ahash_data(n, i, set->dsize); - if (SET_WITH_TIMEOUT(set) && - ip_set_timeout_expired(ext_timeout(e, set))) + if (SET_ELEM_EXPIRED(set, e)) continue; pr_debug("list hash %lu hbucket %p i %u, data %p\n", cb->args[IPSET_CB_ARG0], n, i, e); @@ -1208,6 +1441,7 @@ static const struct ip_set_type_variant mtype_variant = { .uref = mtype_uref, .resize = mtype_resize, .same_set = mtype_same_set, + .region_lock = true, };
#ifdef IP_SET_EMIT_CREATE @@ -1226,6 +1460,7 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set, size_t hsize; struct htype *h; struct htable *t; + u32 i;
pr_debug("Create set %s with family %s\n", set->name, set->family == NFPROTO_IPV4 ? "inet" : "inet6"); @@ -1294,6 +1529,15 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set, kfree(h); return -ENOMEM; } + t->hregion = ip_set_alloc(ahash_sizeof_regions(hbits)); + if (!t->hregion) { + kfree(t); + kfree(h); + return -ENOMEM; + } + h->gc.set = set; + for (i = 0; i < ahash_numof_locks(hbits); i++) + spin_lock_init(&t->hregion[i].lock); h->maxelem = maxelem; #ifdef IP_SET_HASH_WITH_NETMASK h->netmask = netmask; @@ -1304,9 +1548,10 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set, get_random_bytes(&h->initval, sizeof(h->initval));
t->htable_bits = hbits; + t->maxelem = h->maxelem / ahash_numof_locks(hbits); RCU_INIT_POINTER(h->table, t);
- h->set = set; + INIT_LIST_HEAD(&h->ad); set->data = h; #ifndef IP_SET_PROTO_UNDEF if (set->family == NFPROTO_IPV4) { @@ -1329,12 +1574,10 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set, #ifndef IP_SET_PROTO_UNDEF if (set->family == NFPROTO_IPV4) #endif - IPSET_TOKEN(HTYPE, 4_gc_init)(set, - IPSET_TOKEN(HTYPE, 4_gc)); + IPSET_TOKEN(HTYPE, 4_gc_init)(&h->gc); #ifndef IP_SET_PROTO_UNDEF else - IPSET_TOKEN(HTYPE, 6_gc_init)(set, - IPSET_TOKEN(HTYPE, 6_gc)); + IPSET_TOKEN(HTYPE, 6_gc_init)(&h->gc); #endif } pr_debug("create %s hashsize %u (%u) maxelem %u: %p(%p)\n",
From: Jozsef Kadlecsik kadlec@netfilter.org
[ Upstream commit 8af1c6fbd9239877998c7f5a591cb2c88d41fb66 ]
When the forceadd option is enabled, the hash:* types should find and replace the first entry in the bucket with the new one if there are no reuseable (deleted or timed out) entries. However, the position index was just not set to zero and remained the invalid -1 if there were no reuseable entries.
Reported-by: syzbot+6a86565c74ebe30aea18@syzkaller.appspotmail.com Fixes: 23c42a403a9c ("netfilter: ipset: Introduction of new commands and protocol version 7") Signed-off-by: Jozsef Kadlecsik kadlec@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/netfilter/ipset/ip_set_hash_gen.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h index 71e93eac08319..e52d7b7597a0d 100644 --- a/net/netfilter/ipset/ip_set_hash_gen.h +++ b/net/netfilter/ipset/ip_set_hash_gen.h @@ -931,6 +931,8 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext, } } if (reuse || forceadd) { + if (j == -1) + j = 0; data = ahash_data(n, j, set->dsize); if (!deleted) { #ifdef IP_SET_HASH_WITH_NETS
From: Eugenio Pérez eperezma@redhat.com
[ Upstream commit 42d84c8490f9f0931786f1623191fcab397c3d64 ]
Doing so, we save one call to get data we already have in the struct.
Also, since there is no guarantee that getname use sockaddr_ll parameter beyond its size, we add a little bit of security here. It should do not do beyond MAX_ADDR_LEN, but syzbot found that ax25_getname writes more (72 bytes, the size of full_sockaddr_ax25, versus 20 + 32 bytes of sockaddr_ll + MAX_ADDR_LEN in syzbot repro).
Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server") Reported-by: syzbot+f2a62d07a5198c819c7b@syzkaller.appspotmail.com Signed-off-by: Eugenio Pérez eperezma@redhat.com Acked-by: Michael S. Tsirkin mst@redhat.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/vhost/net.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index e158159671fa2..18e205eeb9af7 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1414,10 +1414,6 @@ static int vhost_net_release(struct inode *inode, struct file *f)
static struct socket *get_raw_socket(int fd) { - struct { - struct sockaddr_ll sa; - char buf[MAX_ADDR_LEN]; - } uaddr; int r; struct socket *sock = sockfd_lookup(fd, &r);
@@ -1430,11 +1426,7 @@ static struct socket *get_raw_socket(int fd) goto err; }
- r = sock->ops->getname(sock, (struct sockaddr *)&uaddr.sa, 0); - if (r < 0) - goto err; - - if (uaddr.sa.sll_family != AF_PACKET) { + if (sock->sk->sk_family != AF_PACKET) { r = -EPFNOSUPPORT; goto err; }
From: Daniele Palmas dnlplm@gmail.com
[ Upstream commit eae7172f8141eb98e64e6e81acc9e9d5b2add127 ]
usbnet creates network interfaces with min_mtu = 0 and max_mtu = ETH_MAX_MTU.
These values are not modified by qmi_wwan when the network interface is created initially, allowing, for example, to set mtu greater than 1500.
When a raw_ip switch is done (raw_ip set to 'Y', then set to 'N') the mtu values for the network interface are set through ether_setup, with min_mtu = ETH_MIN_MTU and max_mtu = ETH_DATA_LEN, not allowing anymore to set mtu greater than 1500 (error: mtu greater than device maximum).
The patch restores the original min/max mtu values set by usbnet after a raw_ip switch.
Signed-off-by: Daniele Palmas dnlplm@gmail.com Acked-by: Bjørn Mork bjorn@mork.no Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/usb/qmi_wwan.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c index 9485c8d1de8a3..9dcbca1f8f9aa 100644 --- a/drivers/net/usb/qmi_wwan.c +++ b/drivers/net/usb/qmi_wwan.c @@ -338,6 +338,9 @@ static void qmi_wwan_netdev_setup(struct net_device *net) netdev_dbg(net, "mode: raw IP\n"); } else if (!net->header_ops) { /* don't bother if already set */ ether_setup(net); + /* Restoring min/max mtu values set originally by usbnet */ + net->min_mtu = 0; + net->max_mtu = ETH_MAX_MTU; clear_bit(EVENT_NO_IP_ALIGN, &dev->flags); netdev_dbg(net, "mode: Ethernet\n"); }
From: Haiyang Zhang haiyangz@microsoft.com
[ Upstream commit f6f13c125e05603f68f5bf31f045b95e6d493598 ]
When netvsc_attach() is called by operations like changing MTU, etc., an extra wakeup may happen while netvsc_attach() calling rndis_filter_device_add() which sends rndis messages when queue is stopped in netvsc_detach(). The completion message will wake up queue 0.
We can reproduce the issue by changing MTU etc., then the wake_queue counter from "ethtool -S" will increase beyond stop_queue counter: stop_queue: 0 wake_queue: 1 The issue causes queue wake up, and counter increment, no other ill effects in current code. So we didn't see any network problem for now.
To fix this, initialize tx_disable to true, and set it to false when the NIC is ready to be attached or registered.
Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic") Signed-off-by: Haiyang Zhang haiyangz@microsoft.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/hyperv/netvsc.c | 2 +- drivers/net/hyperv/netvsc_drv.c | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index eab83e71567a9..6c0732fc8c250 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -99,7 +99,7 @@ static struct netvsc_device *alloc_net_device(void)
init_waitqueue_head(&net_device->wait_drain); net_device->destroy = false; - net_device->tx_disable = false; + net_device->tx_disable = true;
net_device->max_pkt = RNDIS_MAX_PKT_DEFAULT; net_device->pkt_align = RNDIS_PKT_ALIGN_DEFAULT; diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index f3f9eb8a402a2..ee1ad7ae75550 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -977,6 +977,7 @@ static int netvsc_attach(struct net_device *ndev, }
/* In any case device is now ready */ + nvdev->tx_disable = false; netif_device_attach(ndev);
/* Note: enable and attach happen when sub-channels setup */ @@ -2354,6 +2355,8 @@ static int netvsc_probe(struct hv_device *dev, else net->max_mtu = ETH_DATA_LEN;
+ nvdev->tx_disable = false; + ret = register_netdevice(net); if (ret != 0) { pr_err("Unable to register netdev.\n");
From: Marek Vasut marex@denx.de
[ Upstream commit 44343418d0f2f623cb9da6f5000df793131cbe3b ]
The KS8851 requires that packet RX and TX are mutually exclusive. Currently, the driver hopes to achieve this by disabling interrupt from the card by writing the card registers and by disabling the interrupt on the interrupt controller. This however is racy on SMP.
Replace this approach by expanding the spinlock used around the ks_start_xmit() TX path to ks_irq() RX path to assure true mutual exclusion and remove the interrupt enabling/disabling, which is now not needed anymore. Furthermore, disable interrupts also in ks_net_stop(), which was missing before.
Note that a massive improvement here would be to re-use the KS8851 driver approach, which is to move the TX path into a worker thread, interrupt handling to threaded interrupt, and synchronize everything with mutexes, but that would be a much bigger rework, for a separate patch.
Signed-off-by: Marek Vasut marex@denx.de Cc: David S. Miller davem@davemloft.net Cc: Lukas Wunner lukas@wunner.de Cc: Petr Stetiar ynezz@true.cz Cc: YueHaibing yuehaibing@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/micrel/ks8851_mll.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/micrel/ks8851_mll.c b/drivers/net/ethernet/micrel/ks8851_mll.c index a41a90c589db2..20cb5b500661a 100644 --- a/drivers/net/ethernet/micrel/ks8851_mll.c +++ b/drivers/net/ethernet/micrel/ks8851_mll.c @@ -548,14 +548,17 @@ static irqreturn_t ks_irq(int irq, void *pw) { struct net_device *netdev = pw; struct ks_net *ks = netdev_priv(netdev); + unsigned long flags; u16 status;
+ spin_lock_irqsave(&ks->statelock, flags); /*this should be the first in IRQ handler */ ks_save_cmd_reg(ks);
status = ks_rdreg16(ks, KS_ISR); if (unlikely(!status)) { ks_restore_cmd_reg(ks); + spin_unlock_irqrestore(&ks->statelock, flags); return IRQ_NONE; }
@@ -581,6 +584,7 @@ static irqreturn_t ks_irq(int irq, void *pw) ks->netdev->stats.rx_over_errors++; /* this should be the last in IRQ handler*/ ks_restore_cmd_reg(ks); + spin_unlock_irqrestore(&ks->statelock, flags); return IRQ_HANDLED; }
@@ -650,6 +654,7 @@ static int ks_net_stop(struct net_device *netdev)
/* shutdown RX/TX QMU */ ks_disable_qmu(ks); + ks_disable_int(ks);
/* set powermode to soft power down to save power */ ks_set_powermode(ks, PMECR_PM_SOFTDOWN); @@ -706,10 +711,9 @@ static netdev_tx_t ks_start_xmit(struct sk_buff *skb, struct net_device *netdev) { netdev_tx_t retv = NETDEV_TX_OK; struct ks_net *ks = netdev_priv(netdev); + unsigned long flags;
- disable_irq(netdev->irq); - ks_disable_int(ks); - spin_lock(&ks->statelock); + spin_lock_irqsave(&ks->statelock, flags);
/* Extra space are required: * 4 byte for alignment, 4 for status/length, 4 for CRC @@ -723,9 +727,7 @@ static netdev_tx_t ks_start_xmit(struct sk_buff *skb, struct net_device *netdev) dev_kfree_skb(skb); } else retv = NETDEV_TX_BUSY; - spin_unlock(&ks->statelock); - ks_enable_int(ks); - enable_irq(netdev->irq); + spin_unlock_irqrestore(&ks->statelock, flags); return retv; }
From: Madhuparna Bhowmik madhuparnabhowmik10@gmail.com
[ Upstream commit 253216ffb2a002a682c6f68bd3adff5b98b71de8 ]
local->sta_mtx is held in __ieee80211_check_fast_rx_iface(). No need to use list_for_each_entry_rcu() as it also requires a cond argument to avoid false lockdep warnings when not used in RCU read-side section (with CONFIG_PROVE_RCU_LIST). Therefore use list_for_each_entry();
Signed-off-by: Madhuparna Bhowmik madhuparnabhowmik10@gmail.com Link: https://lore.kernel.org/r/20200223143302.15390-1-madhuparnabhowmik10@gmail.c... Signed-off-by: Johannes Berg johannes.berg@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- net/mac80211/rx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c index 0e05ff0376726..0ba98ad9bc854 100644 --- a/net/mac80211/rx.c +++ b/net/mac80211/rx.c @@ -4114,7 +4114,7 @@ void __ieee80211_check_fast_rx_iface(struct ieee80211_sub_if_data *sdata)
lockdep_assert_held(&local->sta_mtx);
- list_for_each_entry_rcu(sta, &local->sta_list, list) { + list_for_each_entry(sta, &local->sta_list, list) { if (sdata != sta->sdata && (!sta->sdata->bss || sta->sdata->bss != sdata->bss)) continue;
From: Esben Haabendal esben@geanix.com
[ Upstream commit 84823ff80f7403752b59e00bb198724100dc611c ]
It is possible that the interrupt handler fires and frees up space in the TX ring in between checking for sufficient TX ring space and stopping the TX queue in temac_start_xmit. If this happens, the queue wake from the interrupt handler will occur before the queue is stopped, causing a lost wakeup and the adapter's transmit hanging.
To avoid this, after stopping the queue, check again whether there is sufficient space in the TX ring. If so, wake up the queue again.
This is a port of the similar fix in axienet driver, commit 7de44285c1f6 ("net: axienet: Fix race condition causing TX hang").
Fixes: 23ecc4bde21f ("net: ll_temac: fix checksum offload logic") Signed-off-by: Esben Haabendal esben@geanix.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/xilinx/ll_temac_main.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/xilinx/ll_temac_main.c b/drivers/net/ethernet/xilinx/ll_temac_main.c index 21c1b4322ea78..fd578568b3bff 100644 --- a/drivers/net/ethernet/xilinx/ll_temac_main.c +++ b/drivers/net/ethernet/xilinx/ll_temac_main.c @@ -788,6 +788,9 @@ static void temac_start_xmit_done(struct net_device *ndev) stat = be32_to_cpu(cur_p->app0); }
+ /* Matches barrier in temac_start_xmit */ + smp_mb(); + netif_wake_queue(ndev); }
@@ -830,9 +833,19 @@ temac_start_xmit(struct sk_buff *skb, struct net_device *ndev) cur_p = &lp->tx_bd_v[lp->tx_bd_tail];
if (temac_check_tx_bd_space(lp, num_frag + 1)) { - if (!netif_queue_stopped(ndev)) - netif_stop_queue(ndev); - return NETDEV_TX_BUSY; + if (netif_queue_stopped(ndev)) + return NETDEV_TX_BUSY; + + netif_stop_queue(ndev); + + /* Matches barrier in temac_start_xmit_done */ + smp_mb(); + + /* Space might have just been freed - check again */ + if (temac_check_tx_bd_space(lp, num_frag)) + return NETDEV_TX_BUSY; + + netif_wake_queue(ndev); }
cur_p->app0 = 0;
From: Esben Haabendal esben@geanix.com
[ Upstream commit d07c849cd2b97d6809430dfb7e738ad31088037a ]
This adds error handling to the remaining dma_map_single() calls, so that behavior is well defined if/when we run out of DMA memory.
Fixes: 92744989533c ("net: add Xilinx ll_temac device driver") Signed-off-by: Esben Haabendal esben@geanix.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/xilinx/ll_temac_main.c | 26 +++++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/xilinx/ll_temac_main.c b/drivers/net/ethernet/xilinx/ll_temac_main.c index fd578568b3bff..fd4231493449b 100644 --- a/drivers/net/ethernet/xilinx/ll_temac_main.c +++ b/drivers/net/ethernet/xilinx/ll_temac_main.c @@ -367,6 +367,8 @@ static int temac_dma_bd_init(struct net_device *ndev) skb_dma_addr = dma_map_single(ndev->dev.parent, skb->data, XTE_MAX_JUMBO_FRAME_SIZE, DMA_FROM_DEVICE); + if (dma_mapping_error(ndev->dev.parent, skb_dma_addr)) + goto out; lp->rx_bd_v[i].phys = cpu_to_be32(skb_dma_addr); lp->rx_bd_v[i].len = cpu_to_be32(XTE_MAX_JUMBO_FRAME_SIZE); lp->rx_bd_v[i].app0 = cpu_to_be32(STS_CTRL_APP0_IRQONEND); @@ -863,12 +865,13 @@ temac_start_xmit(struct sk_buff *skb, struct net_device *ndev) skb_dma_addr = dma_map_single(ndev->dev.parent, skb->data, skb_headlen(skb), DMA_TO_DEVICE); cur_p->len = cpu_to_be32(skb_headlen(skb)); + if (WARN_ON_ONCE(dma_mapping_error(ndev->dev.parent, skb_dma_addr))) + return NETDEV_TX_BUSY; cur_p->phys = cpu_to_be32(skb_dma_addr); ptr_to_txbd((void *)skb, cur_p);
for (ii = 0; ii < num_frag; ii++) { - lp->tx_bd_tail++; - if (lp->tx_bd_tail >= TX_BD_NUM) + if (++lp->tx_bd_tail >= TX_BD_NUM) lp->tx_bd_tail = 0;
cur_p = &lp->tx_bd_v[lp->tx_bd_tail]; @@ -876,6 +879,25 @@ temac_start_xmit(struct sk_buff *skb, struct net_device *ndev) skb_frag_address(frag), skb_frag_size(frag), DMA_TO_DEVICE); + if (dma_mapping_error(ndev->dev.parent, skb_dma_addr)) { + if (--lp->tx_bd_tail < 0) + lp->tx_bd_tail = TX_BD_NUM - 1; + cur_p = &lp->tx_bd_v[lp->tx_bd_tail]; + while (--ii >= 0) { + --frag; + dma_unmap_single(ndev->dev.parent, + be32_to_cpu(cur_p->phys), + skb_frag_size(frag), + DMA_TO_DEVICE); + if (--lp->tx_bd_tail < 0) + lp->tx_bd_tail = TX_BD_NUM - 1; + cur_p = &lp->tx_bd_v[lp->tx_bd_tail]; + } + dma_unmap_single(ndev->dev.parent, + be32_to_cpu(cur_p->phys), + skb_headlen(skb), DMA_TO_DEVICE); + return NETDEV_TX_BUSY; + } cur_p->phys = cpu_to_be32(skb_dma_addr); cur_p->len = cpu_to_be32(skb_frag_size(frag)); cur_p->app0 = 0;
From: Esben Haabendal esben@geanix.com
[ Upstream commit 770d9c67974c4c71af4beb786dc43162ad2a15ba ]
Failures caused by GFP_ATOMIC memory pressure have been observed, and due to the missing error handling, results in kernel crash such as
[1876998.350133] kernel BUG at mm/slub.c:3952! [1876998.350141] invalid opcode: 0000 [#1] PREEMPT SMP PTI [1876998.350147] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.3.0-scnxt #1 [1876998.350150] Hardware name: N/A N/A/COMe-bIP2, BIOS CCR2R920 03/01/2017 [1876998.350160] RIP: 0010:kfree+0x1ca/0x220 [1876998.350164] Code: 85 db 74 49 48 8b 95 68 01 00 00 48 31 c2 48 89 10 e9 d7 fe ff ff 49 8b 04 24 a9 00 00 01 00 75 0b 49 8b 44 24 08 a8 01 75 02 <0f> 0b 49 8b 04 24 31 f6 a9 00 00 01 00 74 06 41 0f b6 74 24 5b [1876998.350172] RSP: 0018:ffffc900000f0df0 EFLAGS: 00010246 [1876998.350177] RAX: ffffea00027f0708 RBX: ffff888008d78000 RCX: 0000000000391372 [1876998.350181] RDX: 0000000000000000 RSI: ffffe8ffffd01400 RDI: ffff888008d78000 [1876998.350185] RBP: ffff8881185a5d00 R08: ffffc90000087dd8 R09: 000000000000280a [1876998.350189] R10: 0000000000000002 R11: 0000000000000000 R12: ffffea0000235e00 [1876998.350193] R13: ffff8881185438a0 R14: 0000000000000000 R15: ffff888118543870 [1876998.350198] FS: 0000000000000000(0000) GS:ffff88811f300000(0000) knlGS:0000000000000000 [1876998.350203] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 s#1 Part1 [1876998.350206] CR2: 00007f8dac7b09f0 CR3: 000000011e20a006 CR4: 00000000001606e0 [1876998.350210] Call Trace: [1876998.350215] <IRQ> [1876998.350224] ? __netif_receive_skb_core+0x70a/0x920 [1876998.350229] kfree_skb+0x32/0xb0 [1876998.350234] __netif_receive_skb_core+0x70a/0x920 [1876998.350240] __netif_receive_skb_one_core+0x36/0x80 [1876998.350245] process_backlog+0x8b/0x150 [1876998.350250] net_rx_action+0xf7/0x340 [1876998.350255] __do_softirq+0x10f/0x353 [1876998.350262] irq_exit+0xb2/0xc0 [1876998.350265] do_IRQ+0x77/0xd0 [1876998.350271] common_interrupt+0xf/0xf [1876998.350274] </IRQ>
In order to handle such failures more graceful, this change splits the receive loop into one for consuming the received buffers, and one for allocating new buffers.
When GFP_ATOMIC allocations fail, the receive will continue with the buffers that is still there, and with the expectation that the allocations will succeed in a later call to receive.
Fixes: 92744989533c ("net: add Xilinx ll_temac device driver") Signed-off-by: Esben Haabendal esben@geanix.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/xilinx/ll_temac.h | 1 + drivers/net/ethernet/xilinx/ll_temac_main.c | 112 ++++++++++++++------ 2 files changed, 82 insertions(+), 31 deletions(-)
diff --git a/drivers/net/ethernet/xilinx/ll_temac.h b/drivers/net/ethernet/xilinx/ll_temac.h index 276292bca334d..99fe059e5c7f3 100644 --- a/drivers/net/ethernet/xilinx/ll_temac.h +++ b/drivers/net/ethernet/xilinx/ll_temac.h @@ -375,6 +375,7 @@ struct temac_local { int tx_bd_next; int tx_bd_tail; int rx_bd_ci; + int rx_bd_tail;
/* DMA channel control setup */ u32 tx_chnl_ctrl; diff --git a/drivers/net/ethernet/xilinx/ll_temac_main.c b/drivers/net/ethernet/xilinx/ll_temac_main.c index fd4231493449b..2e3f59dae586e 100644 --- a/drivers/net/ethernet/xilinx/ll_temac_main.c +++ b/drivers/net/ethernet/xilinx/ll_temac_main.c @@ -389,12 +389,13 @@ static int temac_dma_bd_init(struct net_device *ndev) lp->tx_bd_next = 0; lp->tx_bd_tail = 0; lp->rx_bd_ci = 0; + lp->rx_bd_tail = RX_BD_NUM - 1;
/* Enable RX DMA transfers */ wmb(); lp->dma_out(lp, RX_CURDESC_PTR, lp->rx_bd_p); lp->dma_out(lp, RX_TAILDESC_PTR, - lp->rx_bd_p + (sizeof(*lp->rx_bd_v) * (RX_BD_NUM - 1))); + lp->rx_bd_p + (sizeof(*lp->rx_bd_v) * lp->rx_bd_tail));
/* Prepare for TX DMA transfer */ lp->dma_out(lp, TX_CURDESC_PTR, lp->tx_bd_p); @@ -923,27 +924,41 @@ temac_start_xmit(struct sk_buff *skb, struct net_device *ndev) static void ll_temac_recv(struct net_device *ndev) { struct temac_local *lp = netdev_priv(ndev); - struct sk_buff *skb, *new_skb; - unsigned int bdstat; - struct cdmac_bd *cur_p; - dma_addr_t tail_p, skb_dma_addr; - int length; unsigned long flags; + int rx_bd; + bool update_tail = false;
spin_lock_irqsave(&lp->rx_lock, flags);
- tail_p = lp->rx_bd_p + sizeof(*lp->rx_bd_v) * lp->rx_bd_ci; - cur_p = &lp->rx_bd_v[lp->rx_bd_ci]; - - bdstat = be32_to_cpu(cur_p->app0); - while ((bdstat & STS_CTRL_APP0_CMPLT)) { + /* Process all received buffers, passing them on network + * stack. After this, the buffer descriptors will be in an + * un-allocated stage, where no skb is allocated for it, and + * they are therefore not available for TEMAC/DMA. + */ + do { + struct cdmac_bd *bd = &lp->rx_bd_v[lp->rx_bd_ci]; + struct sk_buff *skb = lp->rx_skb[lp->rx_bd_ci]; + unsigned int bdstat = be32_to_cpu(bd->app0); + int length; + + /* While this should not normally happen, we can end + * here when GFP_ATOMIC allocations fail, and we + * therefore have un-allocated buffers. + */ + if (!skb) + break;
- skb = lp->rx_skb[lp->rx_bd_ci]; - length = be32_to_cpu(cur_p->app4) & 0x3FFF; + /* Loop over all completed buffer descriptors */ + if (!(bdstat & STS_CTRL_APP0_CMPLT)) + break;
- dma_unmap_single(ndev->dev.parent, be32_to_cpu(cur_p->phys), + dma_unmap_single(ndev->dev.parent, be32_to_cpu(bd->phys), XTE_MAX_JUMBO_FRAME_SIZE, DMA_FROM_DEVICE); + /* The buffer is not valid for DMA anymore */ + bd->phys = 0; + bd->len = 0;
+ length = be32_to_cpu(bd->app4) & 0x3FFF; skb_put(skb, length); skb->protocol = eth_type_trans(skb, ndev); skb_checksum_none_assert(skb); @@ -958,39 +973,74 @@ static void ll_temac_recv(struct net_device *ndev) * (back) for proper IP checksum byte order * (be16). */ - skb->csum = htons(be32_to_cpu(cur_p->app3) & 0xFFFF); + skb->csum = htons(be32_to_cpu(bd->app3) & 0xFFFF); skb->ip_summed = CHECKSUM_COMPLETE; }
if (!skb_defer_rx_timestamp(skb)) netif_rx(skb); + /* The skb buffer is now owned by network stack above */ + lp->rx_skb[lp->rx_bd_ci] = NULL;
ndev->stats.rx_packets++; ndev->stats.rx_bytes += length;
- new_skb = netdev_alloc_skb_ip_align(ndev, - XTE_MAX_JUMBO_FRAME_SIZE); - if (!new_skb) { - spin_unlock_irqrestore(&lp->rx_lock, flags); - return; + rx_bd = lp->rx_bd_ci; + if (++lp->rx_bd_ci >= RX_BD_NUM) + lp->rx_bd_ci = 0; + } while (rx_bd != lp->rx_bd_tail); + + /* Allocate new buffers for those buffer descriptors that were + * passed to network stack. Note that GFP_ATOMIC allocations + * can fail (e.g. when a larger burst of GFP_ATOMIC + * allocations occurs), so while we try to allocate all + * buffers in the same interrupt where they were processed, we + * continue with what we could get in case of allocation + * failure. Allocation of remaining buffers will be retried + * in following calls. + */ + while (1) { + struct sk_buff *skb; + struct cdmac_bd *bd; + dma_addr_t skb_dma_addr; + + rx_bd = lp->rx_bd_tail + 1; + if (rx_bd >= RX_BD_NUM) + rx_bd = 0; + bd = &lp->rx_bd_v[rx_bd]; + + if (bd->phys) + break; /* All skb's allocated */ + + skb = netdev_alloc_skb_ip_align(ndev, XTE_MAX_JUMBO_FRAME_SIZE); + if (!skb) { + dev_warn(&ndev->dev, "skb alloc failed\n"); + break; }
- cur_p->app0 = cpu_to_be32(STS_CTRL_APP0_IRQONEND); - skb_dma_addr = dma_map_single(ndev->dev.parent, new_skb->data, + skb_dma_addr = dma_map_single(ndev->dev.parent, skb->data, XTE_MAX_JUMBO_FRAME_SIZE, DMA_FROM_DEVICE); - cur_p->phys = cpu_to_be32(skb_dma_addr); - cur_p->len = cpu_to_be32(XTE_MAX_JUMBO_FRAME_SIZE); - lp->rx_skb[lp->rx_bd_ci] = new_skb; + if (WARN_ON_ONCE(dma_mapping_error(ndev->dev.parent, + skb_dma_addr))) { + dev_kfree_skb_any(skb); + break; + }
- lp->rx_bd_ci++; - if (lp->rx_bd_ci >= RX_BD_NUM) - lp->rx_bd_ci = 0; + bd->phys = cpu_to_be32(skb_dma_addr); + bd->len = cpu_to_be32(XTE_MAX_JUMBO_FRAME_SIZE); + bd->app0 = cpu_to_be32(STS_CTRL_APP0_IRQONEND); + lp->rx_skb[rx_bd] = skb; + + lp->rx_bd_tail = rx_bd; + update_tail = true; + }
- cur_p = &lp->rx_bd_v[lp->rx_bd_ci]; - bdstat = be32_to_cpu(cur_p->app0); + /* Move tail pointer when buffers have been allocated */ + if (update_tail) { + lp->dma_out(lp, RX_TAILDESC_PTR, + lp->rx_bd_p + sizeof(*lp->rx_bd_v) * lp->rx_bd_tail); } - lp->dma_out(lp, RX_TAILDESC_PTR, tail_p);
spin_unlock_irqrestore(&lp->rx_lock, flags); }
From: Esben Haabendal esben@geanix.com
[ Upstream commit 1d63b8d66d146deaaedbe16c80de105f685ea012 ]
The SDMA engine used by TEMAC halts operation when it has finished processing of the last buffer descriptor in the buffer ring. Unfortunately, no interrupt event is generated when this happens, so we need to setup another mechanism to make sure DMA operation is restarted when enough buffers have been added to the ring.
Fixes: 92744989533c ("net: add Xilinx ll_temac device driver") Signed-off-by: Esben Haabendal esben@geanix.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/xilinx/ll_temac.h | 3 ++ drivers/net/ethernet/xilinx/ll_temac_main.c | 58 +++++++++++++++++++-- 2 files changed, 56 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/xilinx/ll_temac.h b/drivers/net/ethernet/xilinx/ll_temac.h index 99fe059e5c7f3..53fb8141f1a67 100644 --- a/drivers/net/ethernet/xilinx/ll_temac.h +++ b/drivers/net/ethernet/xilinx/ll_temac.h @@ -380,6 +380,9 @@ struct temac_local { /* DMA channel control setup */ u32 tx_chnl_ctrl; u32 rx_chnl_ctrl; + u8 coalesce_count_rx; + + struct delayed_work restart_work; };
/* Wrappers for temac_ior()/temac_iow() function pointers above */ diff --git a/drivers/net/ethernet/xilinx/ll_temac_main.c b/drivers/net/ethernet/xilinx/ll_temac_main.c index 2e3f59dae586e..eb480204cdbeb 100644 --- a/drivers/net/ethernet/xilinx/ll_temac_main.c +++ b/drivers/net/ethernet/xilinx/ll_temac_main.c @@ -51,6 +51,7 @@ #include <linux/ip.h> #include <linux/slab.h> #include <linux/interrupt.h> +#include <linux/workqueue.h> #include <linux/dma-mapping.h> #include <linux/processor.h> #include <linux/platform_data/xilinx-ll-temac.h> @@ -866,8 +867,11 @@ temac_start_xmit(struct sk_buff *skb, struct net_device *ndev) skb_dma_addr = dma_map_single(ndev->dev.parent, skb->data, skb_headlen(skb), DMA_TO_DEVICE); cur_p->len = cpu_to_be32(skb_headlen(skb)); - if (WARN_ON_ONCE(dma_mapping_error(ndev->dev.parent, skb_dma_addr))) - return NETDEV_TX_BUSY; + if (WARN_ON_ONCE(dma_mapping_error(ndev->dev.parent, skb_dma_addr))) { + dev_kfree_skb_any(skb); + ndev->stats.tx_dropped++; + return NETDEV_TX_OK; + } cur_p->phys = cpu_to_be32(skb_dma_addr); ptr_to_txbd((void *)skb, cur_p);
@@ -897,7 +901,9 @@ temac_start_xmit(struct sk_buff *skb, struct net_device *ndev) dma_unmap_single(ndev->dev.parent, be32_to_cpu(cur_p->phys), skb_headlen(skb), DMA_TO_DEVICE); - return NETDEV_TX_BUSY; + dev_kfree_skb_any(skb); + ndev->stats.tx_dropped++; + return NETDEV_TX_OK; } cur_p->phys = cpu_to_be32(skb_dma_addr); cur_p->len = cpu_to_be32(skb_frag_size(frag)); @@ -920,6 +926,17 @@ temac_start_xmit(struct sk_buff *skb, struct net_device *ndev) return NETDEV_TX_OK; }
+static int ll_temac_recv_buffers_available(struct temac_local *lp) +{ + int available; + + if (!lp->rx_skb[lp->rx_bd_ci]) + return 0; + available = 1 + lp->rx_bd_tail - lp->rx_bd_ci; + if (available <= 0) + available += RX_BD_NUM; + return available; +}
static void ll_temac_recv(struct net_device *ndev) { @@ -990,6 +1007,18 @@ static void ll_temac_recv(struct net_device *ndev) lp->rx_bd_ci = 0; } while (rx_bd != lp->rx_bd_tail);
+ /* DMA operations will halt when the last buffer descriptor is + * processed (ie. the one pointed to by RX_TAILDESC_PTR). + * When that happens, no more interrupt events will be + * generated. No IRQ_COAL or IRQ_DLY, and not even an + * IRQ_ERR. To avoid stalling, we schedule a delayed work + * when there is a potential risk of that happening. The work + * will call this function, and thus re-schedule itself until + * enough buffers are available again. + */ + if (ll_temac_recv_buffers_available(lp) < lp->coalesce_count_rx) + schedule_delayed_work(&lp->restart_work, HZ / 1000); + /* Allocate new buffers for those buffer descriptors that were * passed to network stack. Note that GFP_ATOMIC allocations * can fail (e.g. when a larger burst of GFP_ATOMIC @@ -1045,6 +1074,18 @@ static void ll_temac_recv(struct net_device *ndev) spin_unlock_irqrestore(&lp->rx_lock, flags); }
+/* Function scheduled to ensure a restart in case of DMA halt + * condition caused by running out of buffer descriptors. + */ +static void ll_temac_restart_work_func(struct work_struct *work) +{ + struct temac_local *lp = container_of(work, struct temac_local, + restart_work.work); + struct net_device *ndev = lp->ndev; + + ll_temac_recv(ndev); +} + static irqreturn_t ll_temac_tx_irq(int irq, void *_ndev) { struct net_device *ndev = _ndev; @@ -1137,6 +1178,8 @@ static int temac_stop(struct net_device *ndev)
dev_dbg(&ndev->dev, "temac_close()\n");
+ cancel_delayed_work_sync(&lp->restart_work); + free_irq(lp->tx_irq, ndev); free_irq(lp->rx_irq, ndev);
@@ -1269,6 +1312,7 @@ static int temac_probe(struct platform_device *pdev) lp->dev = &pdev->dev; lp->options = XTE_OPTION_DEFAULTS; spin_lock_init(&lp->rx_lock); + INIT_DELAYED_WORK(&lp->restart_work, ll_temac_restart_work_func);
/* Setup mutex for synchronization of indirect register access */ if (pdata) { @@ -1375,6 +1419,7 @@ static int temac_probe(struct platform_device *pdev) */ lp->tx_chnl_ctrl = 0x10220000; lp->rx_chnl_ctrl = 0xff070000; + lp->coalesce_count_rx = 0x07;
/* Finished with the DMA node; drop the reference */ of_node_put(dma_np); @@ -1406,11 +1451,14 @@ static int temac_probe(struct platform_device *pdev) (pdata->tx_irq_count << 16); else lp->tx_chnl_ctrl = 0x10220000; - if (pdata->rx_irq_timeout || pdata->rx_irq_count) + if (pdata->rx_irq_timeout || pdata->rx_irq_count) { lp->rx_chnl_ctrl = (pdata->rx_irq_timeout << 24) | (pdata->rx_irq_count << 16); - else + lp->coalesce_count_rx = pdata->rx_irq_count; + } else { lp->rx_chnl_ctrl = 0xff070000; + lp->coalesce_count_rx = 0x07; + } }
/* Error handle returned DMA RX and TX interrupts */
From: Ming Lei ming.lei@redhat.com
[ Upstream commit 01e99aeca3979600302913cef3f89076786f32c8 ]
For some reason, device may be in one situation which can't handle FS request, so STS_RESOURCE is always returned and the FS request will be added to hctx->dispatch. However passthrough request may be required at that time for fixing the problem. If passthrough request is added to scheduler queue, there isn't any chance for blk-mq to dispatch it given we prioritize requests in hctx->dispatch. Then the FS IO request may never be completed, and IO hang is caused.
So passthrough request has to be added to hctx->dispatch directly for fixing the IO hang.
Fix this issue by inserting passthrough request into hctx->dispatch directly together withing adding FS request to the tail of hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().
Then it becomes consistent with original legacy IO request path, in which passthrough request is always added to q->queue_head.
Cc: Dongli Zhang dongli.zhang@oracle.com Cc: Christoph Hellwig hch@infradead.org Cc: Ewan D. Milne emilne@redhat.com Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- block/blk-flush.c | 2 +- block/blk-mq-sched.c | 22 +++++++++++++++------- block/blk-mq.c | 18 +++++++++++------- block/blk-mq.h | 3 ++- 4 files changed, 29 insertions(+), 16 deletions(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c index 3f977c517960e..5cc775bdb06ac 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -412,7 +412,7 @@ void blk_insert_flush(struct request *rq) */ if ((policy & REQ_FSEQ_DATA) && !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { - blk_mq_request_bypass_insert(rq, false); + blk_mq_request_bypass_insert(rq, false, false); return; }
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index ca22afd47b3dc..856356b1619e8 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -361,13 +361,19 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, bool has_sched, struct request *rq) { - /* dispatch flush rq directly */ - if (rq->rq_flags & RQF_FLUSH_SEQ) { - spin_lock(&hctx->lock); - list_add(&rq->queuelist, &hctx->dispatch); - spin_unlock(&hctx->lock); + /* + * dispatch flush and passthrough rq directly + * + * passthrough request has to be added to hctx->dispatch directly. + * For some reason, device may be in one situation which can't + * handle FS request, so STS_RESOURCE is always returned and the + * FS request will be added to hctx->dispatch. However passthrough + * request may be required at that time for fixing the problem. If + * passthrough request is added to scheduler queue, there isn't any + * chance to dispatch it given we prioritize requests in hctx->dispatch. + */ + if ((rq->rq_flags & RQF_FLUSH_SEQ) || blk_rq_is_passthrough(rq)) return true; - }
if (has_sched) rq->rq_flags |= RQF_SORTED; @@ -391,8 +397,10 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
WARN_ON(e && (rq->tag != -1));
- if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) + if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) { + blk_mq_request_bypass_insert(rq, at_head, false); goto run; + }
if (e && e->type->ops.insert_requests) { LIST_HEAD(list); diff --git a/block/blk-mq.c b/block/blk-mq.c index 323c9cb28066b..329df7986bf60 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -727,7 +727,7 @@ static void blk_mq_requeue_work(struct work_struct *work) * merge. */ if (rq->rq_flags & RQF_DONTPREP) - blk_mq_request_bypass_insert(rq, false); + blk_mq_request_bypass_insert(rq, false, false); else blk_mq_sched_insert_request(rq, true, false, false); } @@ -1278,7 +1278,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, q->mq_ops->commit_rqs(hctx);
spin_lock(&hctx->lock); - list_splice_init(list, &hctx->dispatch); + list_splice_tail_init(list, &hctx->dispatch); spin_unlock(&hctx->lock);
/* @@ -1629,12 +1629,16 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, * Should only be used carefully, when the caller knows we want to * bypass a potential IO scheduler on the target device. */ -void blk_mq_request_bypass_insert(struct request *rq, bool run_queue) +void blk_mq_request_bypass_insert(struct request *rq, bool at_head, + bool run_queue) { struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
spin_lock(&hctx->lock); - list_add_tail(&rq->queuelist, &hctx->dispatch); + if (at_head) + list_add(&rq->queuelist, &hctx->dispatch); + else + list_add_tail(&rq->queuelist, &hctx->dispatch); spin_unlock(&hctx->lock);
if (run_queue) @@ -1824,7 +1828,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, if (bypass_insert) return BLK_STS_RESOURCE;
- blk_mq_request_bypass_insert(rq, run_queue); + blk_mq_request_bypass_insert(rq, false, run_queue); return BLK_STS_OK; }
@@ -1840,7 +1844,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false, true); if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) - blk_mq_request_bypass_insert(rq, true); + blk_mq_request_bypass_insert(rq, false, true); else if (ret != BLK_STS_OK) blk_mq_end_request(rq, ret);
@@ -1874,7 +1878,7 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx, if (ret != BLK_STS_OK) { if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { - blk_mq_request_bypass_insert(rq, + blk_mq_request_bypass_insert(rq, false, list_empty(list)); break; } diff --git a/block/blk-mq.h b/block/blk-mq.h index eaaca8fc1c287..c0fa34378eb2f 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -66,7 +66,8 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, */ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, bool at_head); -void blk_mq_request_bypass_insert(struct request *rq, bool run_queue); +void blk_mq_request_bypass_insert(struct request *rq, bool at_head, + bool run_queue); void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx, struct list_head *list);
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
[ Upstream commit bdcd3eab2a9ae0ac93f27275b6895dd95e5bf360 ]
After making ext4 support iopoll method: let ext4_file_operations's iopoll method be iomap_dio_iopoll(), we found fio can easily hang in fio_ioring_getevents() with below fio job: rm -f testfile; sync; sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread -rw=write -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.
There are two issues that results in this hang, one reason is that when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio does not use io_uring_enter to get completed events, it relies on kernel io_sq_thread to poll for completed events.
Another reason is that there is a race: when io_submit_sqes() in io_sq_thread() submits a batch of sqes, variable 'inflight' will record the number of submitted reqs, then io_sq_thread will poll for reqs which have been added to poll_list. But note, if some previous reqs have been punted to io worker, these reqs will won't be in poll_list timely. io_sq_thread() will only poll for a part of previous submitted reqs, and then find poll_list is empty, reset variable 'inflight' to be zero. If app just waits these deferred reqs and does not wake up io_sq_thread again, then hang happens.
For app that entirely relies on io_sq_thread to poll completed requests, let io_iopoll_req_issued() wake up io_sq_thread properly when adding new element to poll_list, and when io_sq_thread prepares to sleep, check whether poll_list is empty again, if not empty, continue to poll.
Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- fs/io_uring.c | 59 +++++++++++++++++++++++---------------------------- 1 file changed, 27 insertions(+), 32 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 678c62782ba3b..95df7026ac5aa 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1435,6 +1435,10 @@ static void io_iopoll_req_issued(struct io_kiocb *req) list_add(&req->list, &ctx->poll_list); else list_add_tail(&req->list, &ctx->poll_list); + + if ((ctx->flags & IORING_SETUP_SQPOLL) && + wq_has_sleeper(&ctx->sqo_wait)) + wake_up(&ctx->sqo_wait); }
static void io_file_put(struct io_submit_state *state) @@ -3847,9 +3851,8 @@ static int io_sq_thread(void *data) const struct cred *old_cred; mm_segment_t old_fs; DEFINE_WAIT(wait); - unsigned inflight; unsigned long timeout; - int ret; + int ret = 0;
complete(&ctx->completions[1]);
@@ -3857,39 +3860,19 @@ static int io_sq_thread(void *data) set_fs(USER_DS); old_cred = override_creds(ctx->creds);
- ret = timeout = inflight = 0; + timeout = jiffies + ctx->sq_thread_idle; while (!kthread_should_park()) { unsigned int to_submit;
- if (inflight) { + if (!list_empty(&ctx->poll_list)) { unsigned nr_events = 0;
- if (ctx->flags & IORING_SETUP_IOPOLL) { - /* - * inflight is the count of the maximum possible - * entries we submitted, but it can be smaller - * if we dropped some of them. If we don't have - * poll entries available, then we know that we - * have nothing left to poll for. Reset the - * inflight count to zero in that case. - */ - mutex_lock(&ctx->uring_lock); - if (!list_empty(&ctx->poll_list)) - io_iopoll_getevents(ctx, &nr_events, 0); - else - inflight = 0; - mutex_unlock(&ctx->uring_lock); - } else { - /* - * Normal IO, just pretend everything completed. - * We don't have to poll completions for that. - */ - nr_events = inflight; - } - - inflight -= nr_events; - if (!inflight) + mutex_lock(&ctx->uring_lock); + if (!list_empty(&ctx->poll_list)) + io_iopoll_getevents(ctx, &nr_events, 0); + else timeout = jiffies + ctx->sq_thread_idle; + mutex_unlock(&ctx->uring_lock); }
to_submit = io_sqring_entries(ctx); @@ -3918,7 +3901,7 @@ static int io_sq_thread(void *data) * more IO, we should wait for the application to * reap events and wake us up. */ - if (inflight || + if (!list_empty(&ctx->poll_list) || (!time_after(jiffies, timeout) && ret != -EBUSY && !percpu_ref_is_dying(&ctx->refs))) { cond_resched(); @@ -3928,6 +3911,19 @@ static int io_sq_thread(void *data) prepare_to_wait(&ctx->sqo_wait, &wait, TASK_INTERRUPTIBLE);
+ /* + * While doing polled IO, before going to sleep, we need + * to check if there are new reqs added to poll_list, it + * is because reqs may have been punted to io worker and + * will be added to poll_list later, hence check the + * poll_list again. + */ + if ((ctx->flags & IORING_SETUP_IOPOLL) && + !list_empty_careful(&ctx->poll_list)) { + finish_wait(&ctx->sqo_wait, &wait); + continue; + } + /* Tell userspace we may need a wakeup call */ ctx->rings->sq_flags |= IORING_SQ_NEED_WAKEUP; /* make sure to read SQ tail after writing flags */ @@ -3956,8 +3952,7 @@ static int io_sq_thread(void *data) mutex_lock(&ctx->uring_lock); ret = io_submit_sqes(ctx, to_submit, NULL, -1, &cur_mm, true); mutex_unlock(&ctx->uring_lock); - if (ret > 0) - inflight += ret; + timeout = jiffies + ctx->sq_thread_idle; }
set_fs(old_fs);
From: Monk Liu Monk.Liu@amd.com
[ Upstream commit 4829f89855f1d3a3d8014e74cceab51b421503db ]
fix system memory leak
v2: fix coding style
Signed-off-by: Monk Liu Monk.Liu@amd.com Reviewed-by: Hawking Zhang Hawking.Zhang@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/amd/powerplay/smu_v11_0.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c index 8b13d18c6414c..e4149e6b68b39 100644 --- a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c +++ b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c @@ -948,8 +948,12 @@ int smu_v11_0_init_max_sustainable_clocks(struct smu_context *smu) struct smu_11_0_max_sustainable_clocks *max_sustainable_clocks; int ret = 0;
- max_sustainable_clocks = kzalloc(sizeof(struct smu_11_0_max_sustainable_clocks), + if (!smu->smu_table.max_sustainable_clocks) + max_sustainable_clocks = kzalloc(sizeof(struct smu_11_0_max_sustainable_clocks), GFP_KERNEL); + else + max_sustainable_clocks = smu->smu_table.max_sustainable_clocks; + smu->smu_table.max_sustainable_clocks = (void *)max_sustainable_clocks;
max_sustainable_clocks->uclock = smu->smu_table.boot_values.uclk / 100;
From: Jens Axboe axboe@kernel.dk
[ Upstream commit 2a44f46781617c5040372b59da33553a02b1f46d ]
If work completes inline, then we should pick up a dependent link item in __io_queue_sqe() as well. If we don't do so, we're forced to go async with that item, which is suboptimal.
This also fixes an issue with io_put_req_find_next(), which always looks up the next work item. That should only be done if we're dropping the last reference to the request, to prevent multiple lookups of the same work item.
Outside of being a fix, this also enables a good cleanup series for 5.7, where we never have to pass 'nxt' around or into the work handlers.
Reviewed-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- fs/io_uring.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 95df7026ac5aa..39b18ab928210 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1099,10 +1099,10 @@ static void io_free_req(struct io_kiocb *req) __attribute__((nonnull)) static void io_put_req_find_next(struct io_kiocb *req, struct io_kiocb **nxtptr) { - io_req_find_next(req, nxtptr); - - if (refcount_dec_and_test(&req->refs)) + if (refcount_dec_and_test(&req->refs)) { + io_req_find_next(req, nxtptr); __io_free_req(req); + } }
static void io_put_req(struct io_kiocb *req) @@ -3559,7 +3559,7 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
err: /* drop submission reference */ - io_put_req(req); + io_put_req_find_next(req, &nxt);
if (linked_timeout) { if (!ret)
From: Masahiro Yamada masahiroy@kernel.org
[ Upstream commit 7a04960560640ac5b0b89461f7757322b57d0c7a ]
This if_change_rule is not working properly; it cannot detect any command line change.
The reason is because cmd-check in scripts/Kbuild.include compares $(cmd_$@) and $(cmd_$1), but cmd_dtc_dt_yaml does not exist here.
For if_change_rule to work properly, the stem part of cmd_* and rule_* must match. Because this cmd_and_fixdep invokes cmd_dtc, this rule must be named rule_dtc.
Fixes: 4f0e3a57d6eb ("kbuild: Add support for DT binding schema checks") Signed-off-by: Masahiro Yamada masahiroy@kernel.org Acked-by: Rob Herring robh@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- scripts/Makefile.lib | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib index 3fa32f83b2d74..a66fc0acad1e4 100644 --- a/scripts/Makefile.lib +++ b/scripts/Makefile.lib @@ -291,13 +291,13 @@ DT_TMP_SCHEMA := $(objtree)/$(DT_BINDING_DIR)/processed-schema.yaml quiet_cmd_dtb_check = CHECK $@ cmd_dtb_check = $(DT_CHECKER) -u $(srctree)/$(DT_BINDING_DIR) -p $(DT_TMP_SCHEMA) $@ ;
-define rule_dtc_dt_yaml +define rule_dtc $(call cmd_and_fixdep,dtc,yaml) $(call cmd,dtb_check) endef
$(obj)/%.dt.yaml: $(src)/%.dts $(DTC) $(DT_TMP_SCHEMA) FORCE - $(call if_changed_rule,dtc_dt_yaml) + $(call if_changed_rule,dtc)
dtc-tmp = $(subst $(comma),_,$(dot-target).dts.tmp)
From: Masahiro Yamada masahiroy@kernel.org
[ Upstream commit 964a596db8db8c77c9903dd05655696696e6b3ad ]
The dtbs_check should be a phony target, but currently it is not specified so.
'make dtbs_check' works even if a file named 'dtbs_check' exists because it depends on another phony target, scripts_dtc, but we should not rely on it.
Add dtbs_check to PHONY.
Signed-off-by: Masahiro Yamada masahiroy@kernel.org Acked-by: Rob Herring robh@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Makefile b/Makefile index 0f64b92fa39a6..f86ebe1dd2bd0 100644 --- a/Makefile +++ b/Makefile @@ -1239,7 +1239,7 @@ ifneq ($(dtstree),) %.dtb: include/config/kernel.release scripts_dtc $(Q)$(MAKE) $(build)=$(dtstree) $(dtstree)/$@
-PHONY += dtbs dtbs_install dt_binding_check +PHONY += dtbs dtbs_install dtbs_check dt_binding_check dtbs dtbs_check: include/config/kernel.release scripts_dtc $(Q)$(MAKE) $(build)=$(dtstree)
From: Masahiro Yamada masahiroy@kernel.org
[ Upstream commit c473a8d03ea8e03ca9d649b0b6d72fbcf6212c05 ]
The dt_binding_check is added to PHONY, but it is invisible when $(dtstree) is empty. So, it is not specified as phony for ARCH=x86 etc.
Add it to PHONY outside the ifneq ... endif block.
Signed-off-by: Masahiro Yamada masahiroy@kernel.org Acked-by: Rob Herring robh@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- Makefile | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/Makefile b/Makefile index f86ebe1dd2bd0..4dc2dd35e9e3e 100644 --- a/Makefile +++ b/Makefile @@ -1239,7 +1239,7 @@ ifneq ($(dtstree),) %.dtb: include/config/kernel.release scripts_dtc $(Q)$(MAKE) $(build)=$(dtstree) $(dtstree)/$@
-PHONY += dtbs dtbs_install dtbs_check dt_binding_check +PHONY += dtbs dtbs_install dtbs_check dtbs dtbs_check: include/config/kernel.release scripts_dtc $(Q)$(MAKE) $(build)=$(dtstree)
@@ -1259,6 +1259,7 @@ PHONY += scripts_dtc scripts_dtc: scripts_basic $(Q)$(MAKE) $(build)=scripts/dtc
+PHONY += dt_binding_check dt_binding_check: scripts_dtc $(Q)$(MAKE) $(build)=Documentation/devicetree/bindings
linux-stable-mirror@lists.linaro.org