On Mon, Nov 5, 2018, at 7:10 AM, Milosz Wasilewski wrote:
On Mon, 5 Nov 2018 at 02:29, Rafael David Tinoco rafael.tinoco@linaro.org wrote:
On Sun, Nov 4, 2018, at 10:46 PM, Linaro QA wrote:
Summary
kernel: 4.14.79 git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git git branch: linux-4.14.y git commit: 50961e4888a1d53544ac4ea6f185fc27ee4fee4f git describe: v4.14.79 Test details: https://qa-reports.linaro.org/lkft/linux-stable-rc-4.14-oe/build/v4.14.79
Regressions (compared to build v4.14.78-144-gb825fd9fbad5)
hi6220-hikey - arm64: libhugetlbfs: * mmap-cow-199-200-2M-64
This test creates a mmap for memory backed by hugepages using the file descriptor provided by libhugetlbfs (hugetlbfs_unlined_fd). It creates several childs that will mmap() this same region, mapped by parent, using MAP_PRIVATE flag. This flat will make child virtual memory pages COW. Whenever allocated memory is touched by a child, it has to be duplicated and not to touch parents memory.
This is the first time this tests fails for v4.14. Unfortunately, without stdout for the test, it is impossible to know why the thread has failed - if it failed because the lack of resources or because a child has touched parent's allocated memory - all we got was:
mmap-cow 199 200 (2M: 64): FAIL Thread 68 (pid=3536) failed
And this is the result of parent thread checking
if (WEXITSTATUS(status) != 0)
With that, checking why the thread could have failed:
if (mmap_file(fd, &addr, size, MAP_PRIVATE)) CHILD_FAIL(thread, "mmap() failed: %s", strerror(errno));
meaning it could NOT mmap the file (lack of resources ?)
OR
if (msync(addr, size, MS_SYNC)) CHILD_FAIL(thread, "msync() failed: %s", strerror(errno));
meaning it could NOT flush changes made to child's pages (lack of resources -> to allocated new pages and sync contents).
OR
/* Verify the pattern */ for (i = 0; i < size; i++) if (addr[i] != pattern) CHILD_FAIL(thread, "Corruption at %p; " "Got %c, Expected %c", &addr[i], addr[i], pattern);
meaning child has corrupted parent's memory, when it should not.
@Naresh,
Is it possible to get the output for those tests ? Based on this:
https://lkft.validation.linaro.org/scheduler/job/496395#L1490
libhugetlbfs isn't very chatty. The only stdout from this test is: mmap-cow 199 200 (2M: 64): FAIL Thread 68 (pid=3536) failed set shmmax limit to 419430400
https://lkft.validation.linaro.org/scheduler/job/496395#L1605
I'm not even sure the 'failed set shmmax' line comes from this test.
milosz
I wonder if I can get the output for the tests, checking which message it has. I bet it was something related to lack of resources, specially if hugepages on hikey was under pressure... like we've seen in other tests. But, since this is the 1st time it happens for this test, I thought investigating further would be beneficial.
Best, Rafael
Nope, that output is likely unrelated...
BUT, UNFORTUNATELY, I have reproduced a real bad issue running the same kernel and latest libhugetlbfs tests:
inaddy@hikey:~$ sudo hugeadm --pool-pages-min 2M:200
inaddy@hikey:~$ sudo hugeadm --pool-list Size Minimum Current Maximum Default 2097152 200 200 200 *
root@hikey:obj64$ while true; do ./mmap-cow 199 200 ; RET=`echo $?`; [ $RET -ne 0 ] && break ; done
Thread 95 (pid=12149): Pattern verified Thread 106 (pid=12160): Pattern verified Thread 178 (pid=12232): Pattern verified Thread 156 (pid=12210): Pattern verified Thread 159 (pid=12213): Pattern verified Thread 63 (pid=12117): Pattern verified Thread 184 (pid=12238): Pattern verified Thread 189 (pid=12243): Pattern verified Thread 198 (pid=12252): Pattern verified Thread 115 (pid=12169): Pattern verified Thread 145 (pid=12199) FAIL: Corruption at 0x4000002e24c0; Got d, Expected � Thread 117 (pid=12171): Pattern verified Thread 195 (pid=12249): Pattern verified Thread 52 (pid=12106): Pattern verified Thread 31 (pid=12085): Pattern verified Thread 169 (pid=12223): Pattern verified
Not sure I'm happy reproducing this...
but looks like we might have hit a bug for 4.14 on hikey (or even on arm64). It took quite sometime for me to reproduce this AND I was able to make it happen just once... which means we might have a race causing COW code to allow a PRIVATE only page to mess with the original page when mapping it.. will have to open a bug for this... and can't say its related to latest 4.14 only since this is intermittent and really hard to cause...
I have also caught this:
Nov 5 12:02:54 hikey kernel: [ 22.148194] Unbalanced enable for IRQ 44 Nov 5 12:02:54 hikey kernel: [ 22.152193] ------------[ cut here ]------------ Nov 5 12:02:54 hikey kernel: [ 22.156872] WARNING: CPU: 2 PID: 509 at /home/inaddy/work/sources/linux/stable/stable-linux-4.14.y/kernel/irq/manage.c:525 __enable_irq+0x78/0x80 Nov 5 12:02:54 hikey kernel: [ 22.169931] Modules linked in: ebtable_filter devlink ebtables ip6table_filter ip6_tables iptable_filter cpufreq_powersave cpufreq_conservative cpufreq_userspace cpufreq_ondemand fuse arc4 bridge stp llc wl18xx wlcore nls_ascii nls_cp437 mac80211 vfat fat cfg80211 snd_soc_hdmi_codec rfkill cdc_ether usbnet r8152 mii sd_mod sg aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce gf128mul aes_arm64 sha2_ce sha256_arm64 adv7511 sha1_ce kirin_drm hi6210_i2s drm_kms_helper snd_soc_core wlcore_sdio snd_pcm_dmaengine snd_pcm snd_timer snd dw_drm_dsi soundcore k3dma efi_pstore hisi_thermal drm efivars cpufreq_dt leds_gpio efivarfs ip_tables x_tables autofs4 uas usb_storage btrfs xor zstd_decompress zstd_compress xxhash raid6_pq dwc2 clk_hi655x hi655x_regulator udc_core usbcore dw_mmc_k3 phy_hi6220_usb dw_mmc_pltfm Nov 5 12:02:54 hikey kernel: [ 22.242048] dw_mmc i2c_designware_platform i2c_designware_core fixed hi655x_pmic Nov 5 12:02:54 hikey kernel: [ 22.249606] CPU: 2 PID: 509 Comm: kworker/2:2 Not tainted 4.14.79 #1 Nov 5 12:02:54 hikey kernel: [ 22.255975] Hardware name: HiKey Development Board (DT) Nov 5 12:02:54 hikey kernel: [ 22.261248] Workqueue: events_freezable thermal_zone_device_check Nov 5 12:02:54 hikey kernel: [ 22.267368] task: ffff8000616e0e00 task.stack: ffff00000b5f0000 Nov 5 12:02:54 hikey kernel: [ 22.273312] PC is at __enable_irq+0x78/0x80 Nov 5 12:02:54 hikey kernel: [ 22.277516] LR is at __enable_irq+0x78/0x80 Nov 5 12:02:54 hikey kernel: [ 22.281718] pc : [<ffff00000813e010>] lr : [<ffff00000813e010>] pstate: 000001c5 Nov 5 12:02:54 hikey kernel: [ 22.289129] sp : ffff00000b5f3c80 Nov 5 12:02:54 hikey kernel: [ 22.292457] x29: ffff00000b5f3c80 x28: 0000000000000000 Nov 5 12:02:54 hikey kernel: [ 22.297804] x27: ffff80005c139e38 x26: ffff000008a71870 Nov 5 12:02:54 hikey kernel: [ 22.303148] x25: 0000000000000000 x24: 0000000000000002 Nov 5 12:02:54 hikey kernel: [ 22.308492] x23: ffff00000b5f3d9c x22: ffff80005d565e88 Nov 5 12:02:54 hikey kernel: [ 22.313836] x21: 000000000000f980 x20: 000000000000002c Nov 5 12:02:54 hikey kernel: [ 22.319181] x19: ffff800061726000 x18: 0000000000000010 Nov 5 12:02:54 hikey kernel: [ 22.324524] x17: 0000000000000000 x16: 0000000000000000 Nov 5 12:02:54 hikey kernel: [ 22.329868] x15: ffffffffffffffff x14: ffff000009269c08 Nov 5 12:02:54 hikey kernel: [ 22.335213] x13: ffff00008940678f x12: ffff000009406797 Nov 5 12:02:54 hikey kernel: [ 22.340558] x11: ffff000009290000 x10: ffff00000b5f3980 Nov 5 12:02:54 hikey kernel: [ 22.345902] x9 : 00000000ffffffd0 x8 : ffff00000862c298 Nov 5 12:02:54 hikey kernel: [ 22.351246] x7 : 6c62616e65206465 x6 : 00000000000001b2 Nov 5 12:02:54 hikey kernel: [ 22.356589] x5 : 0000000000000000 x4 : 0000000000000000 Nov 5 12:02:54 hikey kernel: [ 22.361931] x3 : 0000000000000000 x2 : ffff800063e824c8 Nov 5 12:02:54 hikey kernel: [ 22.367275] x1 : 000080005af95000 x0 : 000000000000001c Nov 5 12:02:54 hikey kernel: [ 22.372618] Call trace: Nov 5 12:02:54 hikey kernel: [ 22.375088] Exception stack(0xffff00000b5f3b40 to 0xffff00000b5f3c80) Nov 5 12:02:54 hikey kernel: [ 22.381560] 3b40: 000000000000001c 000080005af95000 ffff800063e824c8 0000000000000000 Nov 5 12:02:54 hikey kernel: [ 22.389417] 3b60: 0000000000000000 0000000000000000 00000000000001b2 6c62616e65206465 Nov 5 12:02:54 hikey kernel: [ 22.397276] 3b80: ffff00000862c298 00000000ffffffd0 ffff00000b5f3980 ffff000009290000 Nov 5 12:02:54 hikey kernel: [ 22.405136] 3ba0: ffff000009406797 ffff00008940678f ffff000009269c08 ffffffffffffffff Nov 5 12:02:54 hikey kernel: [ 22.412994] 3bc0: 0000000000000000 0000000000000000 0000000000000010 ffff800061726000 Nov 5 12:02:54 hikey kernel: [ 22.420852] 3be0: 000000000000002c 000000000000f980 ffff80005d565e88 ffff00000b5f3d9c Nov 5 12:02:54 hikey kernel: [ 22.428710] 3c00: 0000000000000002 0000000000000000 ffff000008a71870 ffff80005c139e38 Nov 5 12:02:54 hikey kernel: [ 22.436569] 3c20: 0000000000000000 ffff00000b5f3c80 ffff00000813e010 ffff00000b5f3c80 Nov 5 12:02:54 hikey kernel: [ 22.444426] 3c40: ffff00000813e010 00000000000001c5 0000000000000000 0000000000000000 Nov 5 12:02:54 hikey kernel: [ 22.452286] 3c60: ffffffffffffffff ffff800061800618 ffff00000b5f3c80 ffff00000813e010 Nov 5 12:02:54 hikey kernel: [ 22.460144] [<ffff00000813e010>] __enable_irq+0x78/0x80 Nov 5 12:02:54 hikey kernel: [ 22.465394] [<ffff00000813e058>] enable_irq+0x40/0x78 Nov 5 12:02:54 hikey kernel: [ 22.470493] [<ffff000000e228a8>] hisi_thermal_get_temp+0x1b0/0x1d8 [hisi_thermal] Nov 5 12:02:54 hikey kernel: [ 22.478008] [<ffff0000087121a8>] of_thermal_get_temp+0x38/0x50 Nov 5 12:02:54 hikey kernel: [ 22.483869] [<ffff000008711790>] thermal_zone_get_temp+0x58/0x80 Nov 5 12:02:54 hikey kernel: [ 22.489903] [<ffff00000870e7bc>] thermal_zone_device_update.part.4+0x2c/0x1a8 Nov 5 12:02:54 hikey kernel: [ 22.497066] [<ffff00000870e9c8>] thermal_zone_device_check+0x40/0x50 Nov 5 12:02:54 hikey kernel: [ 22.503457] [<ffff0000080f1674>] process_one_work+0x19c/0x3d0 Nov 5 12:02:54 hikey kernel: [ 22.509236] [<ffff0000080f18f4>] worker_thread+0x4c/0x428 Nov 5 12:02:54 hikey kernel: [ 22.514664] [<ffff0000080f84fc>] kthread+0x134/0x138 Nov 5 12:02:54 hikey kernel: [ 22.519659] [<ffff000008085154>] ret_from_fork+0x10/0x1c Nov 5 12:02:54 hikey kernel: [ 22.524988] ---[ end trace 328d4bb2d9b066a0 ]---
Which might be unrelated and a second issue to be also investigated.