Performance improvements, new layers, ship models to other frameworks (via ONNX), CUDA9, CuDNNv7, lots of bug fixes

Breaking changes: removed reinforce()
New features
- Unreduced losses
- A profiler for the autograd engine
- More functions support Higher order gradients
- New features in Optimizers
- New layers and nn functionality
- New Tensor functions and Features
- Other additions
API changes
Performance improvements
- Big reduction in framework overhead (helps small models)
- 4x to 256x faster Softmax/LogSoftmax
- More...
Framework Interoperability
- DLPack Interoperability
- Model Exporter to ONNX (ship PyTorch to Caffe2, CoreML, CNTK, MXNet, Tensorflow)
Bug Fixes (a lot of them)

Breaking changes

Stochastic functions, i.e. Variable.reinforce() were removed because of their limited functionality and broad performance implications. The motivation for stochastic functions was to avoid book-keeping of sampled values. In practice, users were still book-keeping in their code for various reasons. We constructed an alternative, equally effective API, but did not have a reasonable deprecation path to the new API. Hence this removal is a breaking change.

We introduce the torch.distributions package to replace Stochastic functions.

Your previous code typically looked like this:

probs = policy_network(state)
action = probs.multinomial()
next_state, reward = env.step(action)
action.reinforce(reward)
action.backward()

This is the new equivalent code:

probs = policy_network(state)
# NOTE: categorical is equivalent to what used to be called multinomial
m = torch.distributions.Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()

New features

Unreduced losses

Now, Some loss functions can compute per-sample losses in a mini-batch

By default PyTorch sums losses over the mini-batch and returns a single scalar loss. This was limiting to users.
Now, a subset of loss functions allow specifying reduce=False to return individual losses for each sample in the mini-batch
Example: loss = nn.CrossEntropyLoss(..., reduce=False)
Currently supported losses: MSELoss, NLLLoss, NLLLoss2d, KLDivLoss, CrossEntropyLoss, SmoothL1Loss, L1Loss
More loss functions will be covered in the next release

An in-built Profiler in the autograd engine

We built a low-level profiler to help you identify bottlenecks in your models

Let us start with an example:

>>> x = Variable(torch.randn(1, 1), requires_grad=True)
>>> with torch.autograd.profiler.profile() as prof:
...     y = x ** 2
...     y.backward()
>>> # NOTE: some columns were removed for brevity
... print(prof)
--------------------------------  ----------  ---------
Name                               CPU time   CUDA time
-------------------------------   ----------  ---------
PowConstant                        142.036us    0.000us
N5torch8autograd9GraphRootE         63.524us    0.000us
PowConstantBackward                184.228us    0.000us
MulConstant                         50.288us    0.000us
PowConstant                         28.439us    0.000us
Mul                                 20.154us    0.000us
N5torch8autograd14AccumulateGradE   13.790us    0.000us
N5torch8autograd5CloneE              4.088us    0.000us

The profiler works for both CPU and CUDA models.
For CUDA models, you have to run your python program with a special nvprof prefix. For example:

nvprof --profile-from-start off -o trace_name.prof -- python <your arguments>

# in python
>>> with torch.cuda.profiler.profile():
...     model(x) # Warmup CUDA memory allocator and profiler
...     with torch.autograd.profiler.emit_nvtx():
...         model(x)

Then, you can load trace_name.prof in PyTorch and print a summary profile report.

>>> prof = torch.autograd.profiler.load_nvprof('trace_name.prof')
>>> print(prof)

Read additional documentation here

Higher order gradients

Added higher-order gradients support for the following layers

ConvTranspose, AvgPool1d, AvgPool2d, LPPool2d, AvgPool3d, MaxPool1d, MaxPool2d, AdaptiveMaxPool, AdaptiveAvgPool, FractionalMaxPool2d, MaxUnpool1d, MaxUnpool2d, nn.Upsample, ReplicationPad2d, ReplicationPad3d, ReflectionPad2d
PReLU, HardTanh, L1Loss, SoftSign, ELU, RReLU, Hardshrink, Softplus, SoftShrink, LogSigmoid, Softmin, GLU
MSELoss, SmoothL1Loss, KLDivLoss, HingeEmbeddingLoss, SoftMarginLoss, MarginRankingLoss, CrossEntropyLoss
DataParallel

Optimizers

optim.SparseAdam: Implements a lazy version of Adam algorithm suitable for sparse tensors.
- In this variant, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.
Optimizers now have an add_param_group function that lets you add new parameter groups to an already constructed optimizer.

New layers and nn functionality

Added AdpativeMaxPool3d and AdaptiveAvgPool3d
Added LPPool1d
F.pad now has support for:
- 'reflection' and 'replication' padding on 1d, 2d, 3d signals (so 3D, 4D and 5D Tensors)
- constant padding on n-d signals
nn.Upsample now works for 1D signals (i.e. B x C x L Tensors) in nearest and linear modes.
grid_sample now allows padding with the border value via padding_mode="border". grid_sample expects a grid in the range of [-1, 1], and if the values are out of these bounds, padding with the value 0.0 is applied by default. However, in a lot of cases, using the border value (i.e. the nearest valid value) helps improve accuracy of the overall model.
Introducing nn.utils.parameters_to_vector and nn.utils.vector_to_parameters
- parameters_to_vector takes net.parameters() and return a 1D vector that contains all the parameters
- vector_to_parameters takes a vector of flattened parameters and copies the values over to a network's parameters
- Convenient for some reinforcement learning algorithms, such as cross-entropy method, TRPO etc., which need to pull all network parameters as one big vector, modify them, and put the modified vector back.
Allow user to not specify certain input dimensions for AdaptivePool*d and infer them at runtime.
- For example:
```
# target output size of 10x7
m = nn.AdaptiveMaxPool2d((None, 7))
```
DataParallel container on CPU is now a no-op (instead of erroring out)

New Tensor functions and features

Introduced torch.erf and torch.erfinv that compute the error function and the inverse error function of each element in the Tensor.
adds broadcasting support to bitwise operators
Added Tensor.put_ and torch.take similar to numpy.take and numpy.put.
- The take function allows you to linearly index into a tensor without viewing it as a 1D tensor
  first. The output has the same shape as the indices.
- The put function copies value into a tensor also using linear indices.
- Differences from numpy equivalents:
  - numpy.take has an optional axis argument, which behaves like index_select. This axis argument is not yet present.
  - numpy.put repeats the values if necessary to make them as long as indices. This behavior is not yet replicated.
add zeros and zeros_like for sparse Tensors.
1-element Tensors can now be casted to Python scalars. For example: int(torch.Tensor([5])) works now.

Other additions

Added torch.cuda.get_device_name and torch.cuda.get_device_capability that do what the names say. Example:
```
>>> torch.cuda.get_device_name(0)
'Quadro GP100'
>>> torch.cuda.get_device_capability(0)
(6, 0)
```
If one sets torch.backends.cudnn.deterministic = True, then the CuDNN convolutions use deterministic algorithms
torch.cuda_get_rng_state_all and torch.cuda_set_rng_state_all are introduced to let you save / load the state of the random number generator over all GPUs at once
torch.cuda.emptyCache() frees the cached memory blocks in PyTorch's caching allocator. This is useful when having long-running ipython notebooks while sharing the GPU with other processes.

API changes

softmax and log_softmax now take a dim argument that specifies the dimension in which slices are taken for the softmax operation. dim allows negative dimensions as well (dim = -1 will be the last dimension)
torch.potrf (Cholesky decomposition) is now differentiable and defined on Variable
Remove all instances of device_id and replace it with device, to make things consistent
torch.autograd.grad now allows you to specify inputs that are unused in the autograd graph if you use allow_unused=True
This gets useful when using torch.autograd.grad in large graphs with lists of inputs / outputs
For example:
```
x, y = Variable(...), Variable(...)
torch.autograd.grad(x * 2, [x, y]) # errors
torch.autograd.grad(x * 2, [x, y], allow_unused=True) # works
```
pad_packed_sequence now allows a padding_value argument that can be used instead of zero-padding
Dataset now has a + operator (which uses ConcatDataset). You can do something like MNIST(...) + FashionMNIST(...) for example, and you will get a concatenated dataset containing samples from both.
torch.distributed.recv allows Tensors to be received from any sender (hence, src is optional). recv returns the rank of the sender.
adds zero_() to Variable
Variable.shape returns the size of the Tensor (now made consistent with Tensor)
torch.version.cuda specifies the CUDA version that PyTorch was compiled with
Add a missing function random_ for CUDA.
torch.load and torch.save can now take a pathlib.Path object, which is a standard Python3 typed filepath object
If you want to load a model's state_dict into another model (for example to fine-tune a pre-trained network), load_state_dict was strict on matching the key names of the parameters. Now we provide a strict=False option to load_state_dict where it only loads in parameters where the keys match, and ignores the other parameter keys.
added nn.functional.embedding_bag that is equivalent to nn.EmbeddingBag