A silent data corruption was introduced in v4.10-rc1 with commit 72ecad22d9f198aafee64218512e02ffa7818671 and was fixed in v4.18-rc7 with commit 17d51b10d7773e4618bcac64648f30f12d4078fb. It affects users of O_DIRECT, in our case a KVM virtual machine with drives which use qemu's "cache=none" option.
The other 2 commits has been accepted in 4.14, but 2 are missing, ref: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1796542
Please consider to include them in next release.
Thanks,
Jack Wang @ 1 & 1 IONOS Cloud GmbH
Christoph Hellwig (1): block: add a lower-level bio_add_page interface
Martin Wilck (1): block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
block/bio.c | 131 ++++++++++++++++++++++++++++++++------------ include/linux/bio.h | 9 +++ 2 files changed, 104 insertions(+), 36 deletions(-)
From: Christoph Hellwig hch@lst.de
commit 0aa69fd32a5f766e997ca8ab4723c5a1146efa8b upstream
For the upcoming removal of buffer heads in XFS we need to keep track of the number of outstanding writeback requests per page. For this we need to know if bio_add_page merged a region with the previous bvec or not. Instead of adding additional arguments this refactors bio_add_page to be implemented using three lower level helpers which users like XFS can use directly if they care about the merge decisions.
Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: Jens Axboe axboe@kernel.dk Reviewed-by: Ming Lei ming.lei@redhat.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com [jwang: cherry pick to 4.14, requred for next patch to build] Signed-off-by: Jack Wang jinpu.wang@cloud.ionos.com --- block/bio.c | 96 +++++++++++++++++++++++++++++---------------- include/linux/bio.h | 9 +++++ 2 files changed, 72 insertions(+), 33 deletions(-)
diff --git a/block/bio.c b/block/bio.c index d01ab919b313..c1386ce2c014 100644 --- a/block/bio.c +++ b/block/bio.c @@ -773,7 +773,7 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page return 0; }
- if (bio->bi_vcnt >= bio->bi_max_vecs) + if (bio_full(bio)) return 0;
/* @@ -821,52 +821,82 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page EXPORT_SYMBOL(bio_add_pc_page);
/** - * bio_add_page - attempt to add page to bio - * @bio: destination bio - * @page: page to add - * @len: vec entry length - * @offset: vec entry offset + * __bio_try_merge_page - try appending data to an existing bvec. + * @bio: destination bio + * @page: page to add + * @len: length of the data to add + * @off: offset of the data in @page * - * Attempt to add a page to the bio_vec maplist. This will only fail - * if either bio->bi_vcnt == bio->bi_max_vecs or it's a cloned bio. + * Try to add the data at @page + @off to the last bvec of @bio. This is a + * a useful optimisation for file systems with a block size smaller than the + * page size. + * + * Return %true on success or %false on failure. */ -int bio_add_page(struct bio *bio, struct page *page, - unsigned int len, unsigned int offset) +bool __bio_try_merge_page(struct bio *bio, struct page *page, + unsigned int len, unsigned int off) { - struct bio_vec *bv; - - /* - * cloned bio must not modify vec list - */ if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) - return 0; + return false;
- /* - * For filesystems with a blocksize smaller than the pagesize - * we will often be called with the same page as last time and - * a consecutive offset. Optimize this special case. - */ if (bio->bi_vcnt > 0) { - bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; + struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
- if (page == bv->bv_page && - offset == bv->bv_offset + bv->bv_len) { + if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) { bv->bv_len += len; - goto done; + bio->bi_iter.bi_size += len; + return true; } } + return false; +} +EXPORT_SYMBOL_GPL(__bio_try_merge_page);
- if (bio->bi_vcnt >= bio->bi_max_vecs) - return 0; +/** + * __bio_add_page - add page to a bio in a new segment + * @bio: destination bio + * @page: page to add + * @len: length of the data to add + * @off: offset of the data in @page + * + * Add the data at @page + @off to @bio as a new bvec. The caller must ensure + * that @bio has space for another bvec. + */ +void __bio_add_page(struct bio *bio, struct page *page, + unsigned int len, unsigned int off) +{ + struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt];
- bv = &bio->bi_io_vec[bio->bi_vcnt]; - bv->bv_page = page; - bv->bv_len = len; - bv->bv_offset = offset; + WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)); + WARN_ON_ONCE(bio_full(bio)); + + bv->bv_page = page; + bv->bv_offset = off; + bv->bv_len = len;
- bio->bi_vcnt++; -done: bio->bi_iter.bi_size += len; + bio->bi_vcnt++; +} +EXPORT_SYMBOL_GPL(__bio_add_page); + +/** + * bio_add_page - attempt to add page to bio + * @bio: destination bio + * @page: page to add + * @len: vec entry length + * @offset: vec entry offset + * + * Attempt to add a page to the bio_vec maplist. This will only fail + * if either bio->bi_vcnt == bio->bi_max_vecs or it's a cloned bio. + */ +int bio_add_page(struct bio *bio, struct page *page, + unsigned int len, unsigned int offset) +{ + if (!__bio_try_merge_page(bio, page, len, offset)) { + if (bio_full(bio)) + return 0; + __bio_add_page(bio, page, len, offset); + } return len; } EXPORT_SYMBOL(bio_add_page); diff --git a/include/linux/bio.h b/include/linux/bio.h index d4b39caf081d..e260f000b9ac 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -123,6 +123,11 @@ static inline void *bio_data(struct bio *bio) return NULL; }
+static inline bool bio_full(struct bio *bio) +{ + return bio->bi_vcnt >= bio->bi_max_vecs; +} + /* * will die */ @@ -459,6 +464,10 @@ void bio_chain(struct bio *, struct bio *); extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int); extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *, unsigned int, unsigned int); +bool __bio_try_merge_page(struct bio *bio, struct page *page, + unsigned int len, unsigned int off); +void __bio_add_page(struct bio *bio, struct page *page, + unsigned int len, unsigned int off); int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter); struct rq_map_data; extern struct bio *bio_map_user_iov(struct request_queue *,
Why does this patch warrant a stable backport?
On Tue, Jun 25, 2019 at 4:25 PM Christoph Hellwig hch@lst.de wrote:
Why does this patch warrant a stable backport?
[jwang: cherry pick to 4.14, requred for next patch to build] :)
On Tue, Jun 25, 2019 at 04:27:44PM +0200, Jinpu Wang wrote:
On Tue, Jun 25, 2019 at 4:25 PM Christoph Hellwig hch@lst.de wrote:
Why does this patch warrant a stable backport?
[jwang: cherry pick to 4.14, requred for next patch to build] :)
There was no next patch in my inbox..
On Tue, Jun 25, 2019 at 4:33 PM Christoph Hellwig hch@lst.de wrote:
On Tue, Jun 25, 2019 at 04:27:44PM +0200, Jinpu Wang wrote:
On Tue, Jun 25, 2019 at 4:25 PM Christoph Hellwig hch@lst.de wrote:
Why does this patch warrant a stable backport?
[jwang: cherry pick to 4.14, requred for next patch to build] :)
There was no next patch in my inbox..
Sorry, it's 17d51b10d777 ("block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs") It has you Reviewed-by tag, I thought git will also sent to you, but checked it's not.
link: https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/stable/0.git/com...
From: Martin Wilck mwilck@suse.com
commit 17d51b10d7773e4618bcac64648f30f12d4078fb upstream
bio_iov_iter_get_pages() currently only adds pages for the next non-zero segment from the iov_iter to the bio. That's suboptimal for callers, which typically try to pin as many pages as fit into the bio. This patch converts the current bio_iov_iter_get_pages() into a static helper, and introduces a new helper that allocates as many pages as
1) fit into the bio, 2) are present in the iov_iter, 3) and can be pinned by MM.
Error is returned only if zero pages could be pinned. Because of 3), a zero return value doesn't necessarily mean all pages have been pinned. Callers that have to pin every page in the iov_iter must still call this function in a loop (this is currently the case).
This change matters most for __blkdev_direct_IO_simple(), which calls bio_iov_iter_get_pages() only once. If it obtains less pages than requested, it returns a "short write" or "short read", and __generic_file_write_iter() falls back to buffered writes, which may lead to data corruption.
Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Martin Wilck mwilck@suse.com Signed-off-by: Jens Axboe axboe@kernel.dk [jwang: cherry-picked to 4.14] Signed-off-by: Jack Wang jinpu.wang@cloud.ionos.com --- block/bio.c | 35 ++++++++++++++++++++++++++++++++--- 1 file changed, 32 insertions(+), 3 deletions(-)
diff --git a/block/bio.c b/block/bio.c index c1386ce2c014..1384f9790882 100644 --- a/block/bio.c +++ b/block/bio.c @@ -902,14 +902,16 @@ int bio_add_page(struct bio *bio, struct page *page, EXPORT_SYMBOL(bio_add_page);
/** - * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio + * __bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio * @bio: bio to add pages to * @iter: iov iterator describing the region to be mapped * - * Pins as many pages from *iter and appends them to @bio's bvec array. The + * Pins pages from *iter and appends them to @bio's bvec array. The * pages will have to be released using put_page() when done. + * For multi-segment *iter, this function only adds pages from the + * the next non-empty segment of the iov iterator. */ -int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) +static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) { unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt, idx; struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; @@ -946,6 +948,33 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) iov_iter_advance(iter, size); return 0; } + +/** + * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio + * @bio: bio to add pages to + * @iter: iov iterator describing the region to be mapped + * + * Pins pages from *iter and appends them to @bio's bvec array. The + * pages will have to be released using put_page() when done. + * The function tries, but does not guarantee, to pin as many pages as + * fit into the bio, or are requested in *iter, whatever is smaller. + * If MM encounters an error pinning the requested pages, it stops. + * Error is returned only if 0 pages could be pinned. + */ +int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) +{ + unsigned short orig_vcnt = bio->bi_vcnt; + + do { + int ret = __bio_iov_iter_get_pages(bio, iter); + + if (unlikely(ret)) + return bio->bi_vcnt > orig_vcnt ? 0 : ret; + + } while (iov_iter_count(iter) && !bio_full(bio)); + + return 0; +} EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
struct submit_bio_ret {
On Tue, Jun 25, 2019 at 04:17:23PM +0200, Jack Wang wrote:
A silent data corruption was introduced in v4.10-rc1 with commit 72ecad22d9f198aafee64218512e02ffa7818671 and was fixed in v4.18-rc7 with commit 17d51b10d7773e4618bcac64648f30f12d4078fb. It affects users of O_DIRECT, in our case a KVM virtual machine with drives which use qemu's "cache=none" option.
The other 2 commits has been accepted in 4.14, but 2 are missing, ref: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1796542
Please consider to include them in next release.
I've ended up cherry picking these two into the 4.14 tree.
-- Thanks, Sasha
Sasha Levin sashal@kernel.org 于2019年6月25日周二 下午9:50写道:
On Tue, Jun 25, 2019 at 04:17:23PM +0200, Jack Wang wrote:
A silent data corruption was introduced in v4.10-rc1 with commit 72ecad22d9f198aafee64218512e02ffa7818671 and was fixed in v4.18-rc7 with commit 17d51b10d7773e4618bcac64648f30f12d4078fb. It affects users of O_DIRECT, in our case a KVM virtual machine with drives which use qemu's "cache=none" option.
The other 2 commits has been accepted in 4.14, but 2 are missing, ref: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1796542
Please consider to include them in next release.
I've ended up cherry picking these two into the 4.14 tree.
Thanks Sasha!
-- Thanks, Sasha
Jack
linux-stable-mirror@lists.linaro.org