Add support for ZeRO-2/3 and ZeRO-offload in fairscale #10354

sgugger · 2021-02-23T16:35:26Z

What does this PR do?

This PR adds support for the new FullyShardedDataParallel introduced in fairscale. See this PR for more details.

The PR changes a tiny bit the behavior of the --sharded_ddp flag/training argument to support a list of options. You can still use the TrainingArguments class with sharded_dpp=True but if launching a script, --sharded_ddp has to be replaced with --sharded_ddp simple. The --sharded_ddp was marked as an experimental API so I think this breaking change is fine if properly documented.

Other values supported are: zero_dp_2, zero_dp_2 offload, zero_dp_3 and zero_dp_3 offload. To fully take advantage of the zero_dp_3/zero_dp_3 offload the model passed to the Trainer will need to have its internal layers wrapped inside the FullyShardedDataParallel, but this out of scope for this particular PR.

For all those new modes, the model simply needs to be wrapped inside FullyShardedDataParallel but the optimizer needs to be created after the model wrapping (to get the parameters shards).

Note that:

predict_with_generate does not work with this integration
cpu_offload does not work for now due to the bug mentioned in this issue. Once the issue is fixed, the option should work with the existing code.

One thing to think further is that this integration breaks the usual convention that self.model is the original model (FullyShardedDataParallel consumes the model to use less memory).

stas00 · 2021-02-23T18:39:48Z

src/transformers/trainer.py

    from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP
    from fairscale.optim import OSS
    from fairscale.optim.grad_scaler import ShardedGradScaler

+    if version.parse(fairscale.__version__) >= version.parse("0.3"):
+        from fairscale.nn.data_parallel import FullyShardedDataParallel as FullyShardedDDP


I think this may introduce a confusion here, should we stick to DP and not DDP to match the real name? i.e. FullyShardedDP and ShardedDP?

Perhaps change the original flag to reflect that as well? --sharded_dp?

OK, made a request to make those names renamed to match DDP here:
facebookresearch/fairscale#413 (comment)

Thanks Stas. I personally think the distinction between DDP and DP is not going to matter anymore. Even pytorch DDP itself is moving to remove the "device_ids" argument in the future so that there isn't a support for a single process DP (as opposed to distributed/multiprocess DP). Therefore, I think sticking with FSDP is fine within fairscale.

Thank you for your follow up, @min-xu-ai

stas00 · 2021-02-23T18:57:31Z

src/transformers/training_args.py

            Use Sharded DDP training from `FairScale <https://github.com/facebookresearch/fairscale>`__ (in distributed
            training only). This is an experimental feature.
+
+            Can take up to six values:


Suggested change

Can take up to six values:

Can be one of the following values:

clarifying that it's one of them

the total count is of no useful value to the user

stas00 · 2021-02-23T19:00:38Z

src/transformers/training_args.py

+            - :obj:`"no"`: for no sharded DataParallelism (default behavior)
+            - :obj:`"simple"`: to use first instance of sharded DDP released by fairscale (:obj:`ShardedDDP`) similar
+              to ZeRO-2.
+            - :obj:`"zero_2"`: to use the second instance of sharded DPP released by fairscale (:obj:`FullyShardedDDP`)


we are smashing concepts a bit here. ZeRO is a big territory with many features. the 3 stages belong to ZeRO-DP part of ZeRO, so ideally this should be zero_dp_(1|2|3) or zero_dp(1|2|3).

This is just a suggestion though, if you strongly feel having just the number is clear enough, that's OK too.

Oh, and that's why they call it DP and not DDP, because it's ZeRO-DP.

stas00 · 2021-02-23T19:09:13Z

Other values supported are: zero2, zero2_offload, zero3 and zero3_offload. To fully take advantage of the zero3/zero3_offload the model passed to the Trainer will need to have its internal layers wrapped inside the FullyShardedDataParallel, but this out of scope for this particular PR.

Do you feel it's better to hardcode these combinations and not have a more flexible approach of:

--sharded_ddp "zero2;offload;future_option"

or

--sharded_ddp "zero2 offload future_option"

which would enable adding new features in the future, without needing to create all possible combinations of options which would double every time a new option will be added.

This is the cmd API I'm tentatively using for the pipelines --pipeline "chunks=5 device_map=0:0-5,1:5-10 ...."

One thing to think further is that this integration breaks the usual convention that self.model is the original model (FullyShardedDataParallel consumes the model to use less memory).

Yes, we will need to rethink this - the trainer is getting more and more complex.

sgugger · 2021-02-23T19:20:55Z

Do you feel it's better to hardcode these combinations and not have a more flexible approach of:

--sharded_ddp "zero2;offload;future_option"

Happy to explore that design as it seems more flexible and less prone to future breaking changes. Will adapt the PR accordingly once we get the wrapper to work.

stas00 · 2021-02-23T19:29:04Z

Probably whitespace separation is more readable: --sharded_ddp "zero2 offload future_option"

Also we need to make sure that we distinguish between FullyShardedDataParallel and ShardedDataParallel since as the commentary was made, they aren't quite the same. Perhaps not_full for ShardedDataParallel? both should be corresponding to stage2 but they don't work in the same way.

Deepspeed has a stage param which goes from 0 to 3. where stage=0 doesn't enable ZeRO, and then each number matches the stage.

For the user's sake perhaps we could make things as similar as possible so it'd be more intuitive for them to switch between fairscale (and eventually pytorch) and deepspeed.

Also note that DeepSpeed exposes other params like the size of buckets, which actually are very important and need to be user-configurable. I won't be surprised that FSDP will also have those configurable down the road - i.e. more params.

sgugger · 2021-02-25T02:20:32Z

Reworked the API to take your suggestion of list of options into account @stas00. I don't think we have to worry about uniformizing with deepspeed or cleaning more at this stage as:

this API will evolve in the future (ShardedDataParallel might very well disappear if FullyShardedDataParallel is better, and this might change again on the road to be merged in PyTorch)
we don't know yet all the options we will have between deepspeed/fairscale/PyTorch
this is an experimental API and while we won't break it just for fun, we can make slight changes down the road.

stas00

Awesome work, @sgugger !!!

src/transformers/trainer.py

stas00 · 2021-02-25T02:19:18Z

src/transformers/training_args.py

-    sharded_ddp: ShardedDDPType = field(
-        default="no",
+    sharded_ddp: str = field(
+        default="",


perhaps list the choices here? and perhaps a very small example of combining 2 of them in the value, since it's not a usual pattern - a user might struggle here.

I agree with @stas00!

src/transformers/training_args.py

docs/source/main_classes/trainer.rst

LysandreJik

Fantastic! Left only nitpicks.

LysandreJik · 2021-02-25T13:55:06Z

docs/source/main_classes/trainer.rst

+3. To use the second version of Sharded data-parallelism, add ``--sharded_ddp zero_dp_2`` or ``--sharded_ddp zero_dp_3`
+   to the command line arguments, and make sure you have added the distributed launcher ``-m torch.distributed.launch
+   --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.


Love the API

docs/source/main_classes/trainer.rst

LysandreJik · 2021-02-25T14:09:01Z

src/transformers/training_args.py

-    sharded_ddp: ShardedDDPType = field(
-        default="no",
+    sharded_ddp: str = field(
+        default="",


I agree with @stas00!

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

stas00 · 2021-02-26T17:41:00Z

Moving out the cl arg naming discussion from #10354 (review) to the open

So if's not DDP but DP, then we should probably change the cl arg to _dp as I suggested above so that it's consistently either DP or DDP all the way through.

Or perhaps we should just call it --sharded? the dp part is already inside the value anyway as in: --sharded zero_dp_3

sgugger added 2 commits February 23, 2021 11:17

Ass support for ZeRO-2/3 and ZeRO-offload in fairscale

2ba8601

Quality

3fc784b

stas00 reviewed Feb 23, 2021

View reviewed changes

sgugger added 2 commits February 24, 2021 21:07

Rework from review comments

442e221

Add doc

1ea56d0

sgugger changed the title ~~[WIP] Add support for ZeRO-2/3 and ZeRO-offload in fairscale~~ Add support for ZeRO-2/3 and ZeRO-offload in fairscale Feb 25, 2021

stas00 approved these changes Feb 25, 2021

View reviewed changes

stas00 reviewed Feb 25, 2021

View reviewed changes

docs/source/main_classes/trainer.rst Outdated Show resolved Hide resolved

stas00 reviewed Feb 25, 2021

View reviewed changes

docs/source/main_classes/trainer.rst Outdated Show resolved Hide resolved

stas00 reviewed Feb 25, 2021

View reviewed changes

docs/source/main_classes/trainer.rst Outdated Show resolved Hide resolved

stas00 reviewed Feb 25, 2021

View reviewed changes

docs/source/main_classes/trainer.rst Outdated Show resolved Hide resolved

LysandreJik approved these changes Feb 25, 2021

View reviewed changes

sgugger and others added 2 commits February 25, 2021 09:32

Apply suggestions from code review

0f20f1d

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Address review comments

ab10ada

sgugger merged commit 9d14be5 into master Feb 25, 2021

sgugger deleted the zero_3_offload branch February 25, 2021 16:07

statelesshz mentioned this pull request Oct 7, 2023

remove the obsolete code related to fairscale FSDP #26651

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for ZeRO-2/3 and ZeRO-offload in fairscale #10354

Add support for ZeRO-2/3 and ZeRO-offload in fairscale #10354

sgugger commented Feb 23, 2021 •

edited

stas00 Feb 23, 2021

stas00 Feb 23, 2021 •

edited

min-xu-ai Feb 26, 2021

stas00 Feb 26, 2021

stas00 Feb 23, 2021

stas00 Feb 23, 2021

stas00 Feb 23, 2021 •

edited

stas00 commented Feb 23, 2021 •

edited

sgugger commented Feb 23, 2021

stas00 commented Feb 23, 2021 •

edited

sgugger commented Feb 25, 2021

stas00 left a comment

stas00 Feb 25, 2021

LysandreJik Feb 25, 2021

LysandreJik left a comment

LysandreJik Feb 25, 2021

LysandreJik Feb 25, 2021

stas00 commented Feb 26, 2021 •

edited

	Can take up to six values:
	Can be one of the following values:

Add support for ZeRO-2/3 and ZeRO-offload in fairscale #10354

Add support for ZeRO-2/3 and ZeRO-offload in fairscale #10354

Conversation

sgugger commented Feb 23, 2021 • edited

What does this PR do?

Choose a reason for hiding this comment

stas00 Feb 23, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 Feb 23, 2021 • edited

Choose a reason for hiding this comment

stas00 commented Feb 23, 2021 • edited

sgugger commented Feb 23, 2021

stas00 commented Feb 23, 2021 • edited

sgugger commented Feb 25, 2021

stas00 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 commented Feb 26, 2021 • edited

sgugger commented Feb 23, 2021 •

edited

stas00 Feb 23, 2021 •

edited

stas00 Feb 23, 2021 •

edited

stas00 commented Feb 23, 2021 •

edited

stas00 commented Feb 23, 2021 •

edited

stas00 commented Feb 26, 2021 •

edited