0% found this document useful (0 votes)
58 views10 pages

PixelTransformer - Sample Conditioned Signal Generation

The document presents PixelTransformer, a generative model designed for sample conditioned signal generation, which infers distributions based on sparse pixel observations. Unlike traditional autoregressive models that follow a fixed sequencing, PixelTransformer allows for arbitrary conditioning on sample pixels and can generate diverse outputs for various spatial signals, including images, 3D shapes, and videos. The approach is validated across multiple datasets, demonstrating its ability to predict distributions effectively from limited input data.

Uploaded by

a710859414
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views10 pages

PixelTransformer - Sample Conditioned Signal Generation

The document presents PixelTransformer, a generative model designed for sample conditioned signal generation, which infers distributions based on sparse pixel observations. Unlike traditional autoregressive models that follow a fixed sequencing, PixelTransformer allows for arbitrary conditioning on sample pixels and can generate diverse outputs for various spatial signals, including images, 3D shapes, and videos. The approach is validated across multiple datasets, demonstrating its ability to predict distributions effectively from limited input data.

Uploaded by

a710859414
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

PixelTransformer: Sample Conditioned Signal Generation

Shubham Tulsiani 1 Abhinav Gupta 1 2


https://shubhtuls.github.io/PixelTransformer/

Abstract sive approaches directly model the joint distribution over


pixels. This can be easily cast as product of conditional
arXiv:2103.15813v1 [cs.CV] 29 Mar 2021

We propose a generative model that can infer a


distribution (van den Oord et al., 2016b;c) which makes
distribution for the underlying spatial signal con-
it tractable to compute. Conditional distributions are esti-
ditioned on sparse samples e.g. plausible images
mated by learning to predict new pixels from previously
given a few observed pixels. In contrast to sequen-
sampled/generated pixels. However, these approaches use
tial autoregressive generative models, our model
fixed sequencing (mostly predicting pixels from top-left to
allows conditioning on arbitrary samples and can
bottom-right) and therefore the learned model can only take
answer distributional queries for any location. We
a fixed ordering between query and sampled pixels. This
empirically validate our approach across three im-
implies that these models cannot predict whole images from
age datasets and show that we learn to generate
a few random splashes – similar to what we humans can do
diverse and meaningful samples, with the distri-
given a description of the artist’s painting above.
bution variance reducing given more observed
pixels. We also show that our approach is appli- In this work, our goal is to build computational genera-
cable beyond images and can allow generating tive models that can achieve this – given information about
other types of spatial outputs e.g. polynomials, some random pixels and their associated color values, we
3D shapes, and videos. aim to predict a distribution over images consistent with
the evidence. To this end, we show that it suffices to learn
a function that estimates the distribution of possible val-
ues at any query location conditioned on an arbitrary set
1. Introduction of observed samples. We present an approach to learn this
Imagine an artist with an empty canvas. She starts with a function in a self-supervised manner, and show that it can
dab of sky blue paint at the top, and a splash of fresh green allow answering queries that previous sequential autoregres-
at the bottom. What is the painting going to depict? Perhaps sive models cannot e.g. mean image given observed pixels,
an idyllic meadow, or trees in garden under a clear sky? But or computing image distribution given random observations.
probably not a living room. It is quite remarkable that given We also show that our proposed framework is generally
only such sparse information about arbitrary locations, we applicable beyond images and can be learned to generate
can make guesses about the image in the artist’s mind. generic dense spatial signals given corresponding samples.
The field of generative modeling of images, with the goal
of learning the distribution of possible images, focuses on
developing similar capabilities in machines. Most recent 2. Formulation
approaches can be classified as belonging to one of the two Given the values of some (arbitrary) pixels, we aim to in-
modeling frameworks. First, and more commonly used, is fer what images are likely conditioned on this observation.
the latent variable modeling framework (Kingma & Welling, More formally, for any pixel denoted by random variable x,
2013; Goodfellow et al., 2014). Here, the goal is to represent let vx denote the value for that pixel and let S0 ≡ {vxk }Kk=1
the possible images using a distribution over a bottleneck correspond to a set of such sampled values. We are then in-
latent variable, samples from which can be decoded to ob- terested in modeling p(I|S0 ) i.e. the conditional distribution
tain images. However, computing the exact probabilities over images I given a set of sample pixel values S0 .
for images is often intractable and it is not straightforward
From Image to Pixel Value Distribution. We first note
to condition inference on sparse observations e.g. pixel
that an image is simply a collection of values of pixels
values. As an alternative, a second class of autoregres-
in a discrete grid. Assuming an image has N pixels with
1
Facebook AI Research 2 Carnegie Mellon University. Corre- locations denoted as {gn }N n=1 , our goal is therefore to model
spondence to: Shubham Tulsiani <[email protected]>. p(I|S0 ) ≡ p(vg1 , vg2 , . . . , vgN |S0 ). Instead of modeling
PixelTransformer: Sample Conditioned Signal Generation

this joint distribution directly, we observe that it can be base architectures (Chen et al., 2020; Parmar et al., 2018)
further factorized as a product of conditional distributions such as Transformers (Vaswani et al., 2017).
using the chain rule:
While this line of work has led to impressive results, the
Y core distribution modeled is that of the ‘next’ value given
p(vg1 , vg2 , . . . , vgN |S0 ) = p(vgn |S0 , vg1 , . . . , vgn−1 )
‘previous’ values. More formally, while we aim to predict
n
p(vx |S) for arbitrary x, S, the prior autoregressive genera-
Denoting by Sn ≡ S0 ∪ {vgj }nj=1 , we obtain: tive models only infer this for cases where S contains pixels
in some sequential (e.g. raster) order and x is the immediate
‘next’ position. Although using masked convolutions can
Y
p(I|S0 ) = p(vgn |Sn−1 ) (1)
n
allow handling many possible inference orders (Jain et al.,
2020), the limited receptive field of convolutions still limits
Sample Conditioned Value Prediction. The key observa- such orders to locally continuous sequences. Our work can
tion from Eq. 1 is that all the factors are in the form of therefore be viewed as a generalization of previous ‘sequen-
p(vx |S). That is, the only queries we need to answer are: tial’ autoregressive models in two ways: a) allowing any
‘given some observed samples S, what is the distribution query position x, and b) handling arbitrary samples S for
of possible values at location x’? To learn a sample condi- conditioning. This allows us to answer questions that prior
tioned generative model for images, we therefore propose autoregressive models cannot e.g. ‘if the top-left pixel is
to learn a function fθ to infer p(vx |S) for arbitrary inputs x blue, how likely is the bottom-right one to be green?’, ‘what
and S. Concretely, we formulate our task as that of learn- is the mean image given some observations?’, or ‘given
ing a function fθ (x, {(xk , vk )}) that can predict the value values of 10 specific pixels, sample likely images’.
distribution at an arbitrary query location x given a set of
Implicit Neural Representations. There has been a grow-
arbitrary sample (position, value) pairs {(xk , vk )}.
ing interest in learning neural networks to represent 3D tex-
In summary: tured scenes (Sitzmann et al., 2019), radiance fields (Milden-
hall et al., 2020; Martin-Brualla et al., 2021; Zhang et al.,
• The task of inferring p(I|S0 ) can be reduced to queries
2020) or more generic spatial signals (Sitzmann et al., 2020;
of the form p(vx |S).
Tancik et al., 2020). The overall approach across these
• We propose to learn a function fθ (x, {(xk , vk )}) that methods is to represent the underlying signal by learning
can predict p(vx |{vxk }) for arbitrary inputs. a function gφ that maps query positions x to correspond-
ing values v (e.g. pixel location to intensity). Our learned
While we used images as a motivating example, our for- fθ (·, {(xk , vk )}) can similarly be thought of as mapping
mulation is also applicable for modeling distributions of query positions to a corresponding value (distribution),
other dense spatially varying signals. For RGB images, while being conditioned on some sample values. A key
x ∈ R2 , v ∈ R3 , but other spatial signals e.g. polynomials difference however, is the ability to generalize – the above
(x ∈ R1 , v ∈ R1 ), 3D shapes represented as Signed Dis- mentioned approaches learn an independent network per
tance Fields, (x ∈ R3 , v ∈ R1 ) or videos (x ∈ R3 , v ∈ R3 ) instance e.g. a separate gφ is used to model each scene,
can also be handled by learning fθ (x, {(xk , vk )}) of the therefore requiring from thousands to millions of samples to
corresponding form (see Section 6). fit gφ for a specific scene. In contrast, our approach uses a
common fθ across all instances and can therefore generalize
3. Related Work to unseen ones given only a sparse set of samples. Although
Autoregressive Generative Models. Closely related to our some recent approaches (Xu et al., 2019; Park et al., 2019;
work, autoregressive generative modeling approaches also Mescheder et al., 2019) have shown similar ability to gener-
factorize the joint distribution into per-location conditional alize and infer novel 3D shapes/scenes given input image(s),
distributions. Seminal works such as Wavenet (van den these cannot handle sparse input samples and do not allow
Oord et al., 2016a), PixelRNN (van den Oord et al., 2016c) inferring a distribution over the output space.
and PixelCNN (van den Oord et al., 2016b) showed that Latent Variable based Generative Models. Our approach,
we can learn the distribution over the values of the ‘next’ similar to sequential autoregressive models, factorizes the
timestep/pixel given the values of the previous ones, and image distribution as products of per-pixel distributions.
thereby learn a generative model for the corresponding An alternate approach to generative modeling, however,
domain (speech/images). Subsequent approaches have is to transform a prior distribution over latent variables
further improved over these works by modifying the to the output distribution via a learned decoder. Several
parametrization (Salimans et al., 2017), incorporating hier- approaches allow learning such a decoder by leveraging
archy (van den Oord et al., 2017; Razavi et al., 2019), or diverse objectives e.g. adversarial loss (Goodfellow et al.,
(similar to ours) foregoing convolutions in favor of alternate
PixelTransformer: Sample Conditioned Signal Generation

2014), variational bound on the log-likelihood (Kingma


& Welling, 2013), nearest neighbor matching (Bojanowski
et al., 2018; Li & Malik, 2018), or the log-likelihood with a
restricted decoder (Rezende & Mohamed, 2015). While all
of these methods allow efficiently generating new samples
from scratch (by randomly sampling in the latent space),
it is not straightforward to condition this sampling given
partial observations – which is the goal of our work.
Bayesian Optimization and Gaussian Processes. As al-
luded to earlier, any spatial signal can be considered a func-
tion from positions to values. Our goal is then to infer a
distribution over possible functions given a set of samples.
This is in fact also a central problem tackled in bayesian
Figure 1. Prediction Model. Given a set of (position, value)
optimization (Brochu et al., 2010), using techniques such as
pairs {(xk , vk )}, our model encodes them using a Trans-
gaussian processes (Rasmussen, 2003) to model the distri-
former (Vaswani et al., 2017) encoder. A query position x is
bution over functions. While the goal of these approaches then processed in context of this encoding and a value distribution
is similar to ours, the technique differs significantly. These is predicted (parametrized by ω).
classical methods assume a known prior over the space of
functions and leverage it to obtain the posterior given some this distribution? Popular choices like gaussian parametriza-
samples (we refer the reader to (Murphy, 2012) for an ex- tion may not capture the multimodal nature of the distribu-
cellent overview). Such a prior over functions (that also tion e.g. a pixel maybe black or white, but not gray. An
supports tractable inference), however, is not easily avail- alternate is to discretize the output space but this may re-
able for complex signals such as images or 3D shapes – quire a large number of bins e.g. 2563 for possible RGB
although some weak priors (Ulyanov et al., 2018; Osher values. Following PixelCNN++ (Salimans et al., 2017), we
et al., 2017) do allow impressive image restoration, they do opt for a hybrid approach – we predict probabilities for the
not enable generation given sparse samples. In contrast, our value belonging to one of B discrete bins, while also pre-
approach allows learning from data, and can be thought of as dicting a continuous gaussian parametrization within each
learning this prior as well as performing efficient inference bin. This allows predicting multimodal distributions while
via the learned model fθ . enabling continuous outputs.
Concretely, we instantiate B bins (roughly) uniformly
4. Learning and Inference spaced across the output space where for any bin b, its
Towards inferring the distribution of images given a set center corresponds to cb . The output distribution is then
of observed samples, we presented a formulation in Sec- parametrized as ω ≡ {(q b , µb , σ b )}B b 1
b=1 . Here q ∈ R is
b b
tion 2 that reduced this task to that of learning a function to the probability of assignment to bin b, c + µ is the mean of
model p(vx |{vxk }). We first describe in Section 4.1 how the corresponding gaussian distribution with uniform vari-
we parametrize this function and how one can learn it from ance σ b ∈ R1 . Assuming the values v ∈ Rd , our network
raw data. We then show in Section 4.2 and Section 4.3 how therefore outputs ω ∈ RB×(d+2) . We note that this distri-
this learned function can be used to query and draw sam- bution is akin to a mixture-of-gaussians, and given a value
ples from the conditional distribution over images p(I|S0 ). v, we can efficiently compute its likelihood p(v; ω) under it
While we use images as the running example, we reiterate (see appendix for details). We can also efficiently compute
that the approach is more generally applicable (as we also the expected value v̄ as:
empirically show in Section 6). Z XB
v̄ ≡ p(v; ω) v dv = q b (µb + cb ) (2)
4.1. Learning to Predict Value Distributions b=1
We want to learn a function fθ that can predict the proba- Model Architecture. Given a query position x, we
bility distribution of possible values at any query location want fθ (x, {(xk , vk )}) to output a value distribution as
x conditioned on a (arbitrary) set of positions with known parametrized above. There are two design considerations
values. More formally, we want fθ (x, {(xk , vk )}) to ap- that such a predictor should respect: a) allow a variable
proximate p(vx |{vxk }). number of input samples {(xk , vk )}, and b) be permutation-
Distribution Parametrization. The output of fθ is sup- invariant w.r.t. the samples. We leverage the Trans-
posed to be a distribution over possible values at location x former (Vaswani et al., 2017) architecture as our backbone
and not a single value estimate. How should we parametrize as it satisfies both these requirements. As depicted in Fig-
ure 1, our model can be considered as having two stages:
PixelTransformer: Sample Conditioned Signal Generation

4.2. Inferring Marginals and Mean


Section 4.1 introduced our approach to enable learning fθ
that can approximate p(v|S). But given such a learned
function, what can it enable us to do? One operation that
we focus on later in Section 4.3 is that of sampling images
I ∼ p(I|S). However, there is another question of interest
which is not possible to answer with the previous sequential
autoregressive models (van den Oord et al., 2016b;a), but
is efficiently computable using our model: ‘what is the
Figure 2. Training Overview. Given an image, we randomly sam- expected image I¯ given the samples S?’.
ple pixels to obtain the conditioning set S as well as a query pixel
x with value vx∗ . Our model predicts the conditional value distri- We reiterate that an image can be considered as a collection
bution for this arbitrary query location and we use the negative of values of pixels located in a discrete grid {gn }N
n=1 . In-
log-likelihood for the true value as our learning objective. stead of asking what the expected image I¯ is, we can first
consider a simpler question – what is the expected value
a) an encoder that, independent of the query x, processes v̄gn for the pixel gn given S? By definition:
the input samples {(xk , vk )} and computes a per-sample Z
embedding, and b) a decoder that predicts the output distri- v̄gn = p(vgn |S) vgn dvgn
bution by processing the query x in context of the encodings.
As shown in Figure 1, we first independently embed each As our learned model fθ allows us to directly estimate the
input sample (xk , vk ) using position and value encoding marginal distribution p(vgn |S), the above computation is
modules respectively, while following the insight from (Tan- extremely efficient to perform and can be done indepen-
cik et al., 2020) to use fourier features when embedding po- dently across all locations in the image grid {gn }N
n=1 .
sitions. These per-sample encodings are then processed by a Z
sequence of multi-headed self-attention modules (Vaswani v̄gn = p(v; ωn ) v dv; ωn = fθ (gn , {(xk , vk )}) (4)
et al., 2017) to yield the encoded representations for the
input samples. The query position x is similarly embedded, Given the estimate of v̄gn , the mean image I¯ is then just
and processed via multi-headed attention modules in context the image with each pixel assigned its mean value v̄gn i.e.
of the sample embeddings. A linear decoder finally predicts I¯ ≡ {v̄gn }N
n=1 . The key difference compared to sequen-
ω ∈ RB×(d+2) to parametrize the output distribution. tial autoregressive models (van den Oord et al., 2016b;a)
that enables our model to compute this mean image is that
Training Objective. Recall that our model our model allows computing p(vgn |S) for any location gn ,
fθ (x, {(xk , vk )}) aims to approximate p(vx |{vxk }) whereas approaches like (van den Oord et al., 2016b;a) can
for arbitrary query positions x and sample sets S ≡ {vxk }. only do so for the ‘next’ pixel.
Given a collection of training images, we can in fact
generate training data for this model in a self-supervised 4.3. Autoregressive Conditional Sampling
manner. As illustrated in Figure 2, we can simply One of the driving motivations for our work was to be able
sample arbitrary x, S from any image, and maximize the to sample the various likely images conditioned on a sparse
log-likelihood of the true value vx∗ under the predicted set of pixels with known values. That is, we want to be able
distribution p(vx |{vxk }). to draw samples from p(I|S0 ). Equivalently, to sample an
While we described the processing for a single query posi- image from p(I|S0 ), we need to sample the values at each
tion x, it is easy to parallelize inference and process a batch pixel {vgn } from p(vg1 , vg2 , . . . , vgN |S0 ).
of queries Q conditioned on the same input sample set S. As we derived in Eq. 1, this distribution can be factored
In this case, we can consider the model as independently as a product of per-pixel conditional distributions. We can
predicting p(vx |{vxk }) for each x ∈ Q. Instead of using a therefore sample from this distribution autoregressively –
single query x, we therefore use a batch of queries Q and sampling one pixel at a time, with subsequent pixels being
minimize the negative log-likelihood across them. More informed by ones sampled prior. Concretely, we iteratively
formally, given a dataset D of images, we randomly sample perform the following computation:
an image I, and then choose arbitrary sample and query sets
S, Q, and minimize the expected negative log-likelihood of ωn = fθ (gn , {xk , vk } ∪ {gj , vj0 }n−1
j=1 ) (5)
the true values as our training objective: vn0 ∼ p(v; ωn ) (6)
L= E E E − log p(vx∗ ; ω)
I∼D S,Q∼I x∼Q
(3) Here, ωn denotes the parameters for the predicted distribu-
where, ω = fθ (x; {(xk , vk )}) tion for the pixel gn . Note that this prediction takes into
PixelTransformer: Sample Conditioned Signal Generation

Figure 3. Inferred Mean Images. We visualize the mean image predicted by our learned model on random instances of the Cat Faces
dataset. Top row: ground-truth image. Rows 2-8: Predictions using increasing number of observed pixels |S|.

account not just the initial samples S0 , but also the subse- 5. Experiments
quent n − 1 samples (hence the difference from ωn in Eq. 4). To qualitatively and quantitatively demonstrate the efficacy
vn0 represents a value then sampled for the pixel gn from of our approach, we consider the task of generating images
the distribution parametrized by ωn . given a set of pixels with known values. The goal of our
Randomized Sampling Order. While we sample the val- experiments is twofold – a) to validate that our predictions
ues one pixel at a time, the ordering of pixels g1 , . . . , gN account for the observed pixels, and b) to show that the
need not correspond to anything specific e.g. it is not nec- generated samples are diverse and plausible.
essary that g1 should be the top-left pixel and gN be the Datasets. We examine our approach on three different im-
bottom-right one. In fact, as our model fθ is trained using age datasets – CIFAR10 (Krizhevsky, 2009), MNIST (Le-
arbitrary sets of samples S, using a structured sampling or- Cun et al., 1998), and the Cat Faces (Wu et al., 2020) dataset
dering e.g. raster order would make the testing setup differ while using the standard image splits. Note that we only
from training. Instead, for every sample I ∼ p(I|S) that we require the images for training – class or attribute labels are
draw, we use a new random order in which the pixels of the not leveraged for learning our models i.e. even on CIFAR10,
image grid are sampled. we learn a class-agnostic generative model.
Sidestepping Memory Bottlenecks. As Eq. 5 indicates,
the input to fθ when sampling the (n + 1)th pixel is a set of Training Setup. We vary the number of observed pixels S
size K +n – the initial K observations and the subsequent n randomly between 4 and 2048 (with uniform sampling in
samples. Unfortunately, our model’s memory requirement, log-scale), while the number of query samples Q is set to
due to the self-attention modules, grows cubically with this 2048. During training, the locations x are treated as varying
input size. This makes it infeasible to autoregressively sam- over a continuous domain, using bilinear sampling to obtain
ple a very large number of pixels. However, we empirically the corresponding value – this helps our implementation
observe that given a sufficient number of (random) samples, be agnostic to the image resolution in the dataset. While
subsequent pixel value distributions do not exhibit a high we train a separate network fθ for each dataset, we use the
variance. We leverage this observation to design a hybrid exact same model, hyper-parameters etc. across them.
sampling strategy. When generating an image with N pix- Qualitative Results: Mean Image Prediction. We first
els, we sample the first N 0 (typically 2048) autoregressively examine the expected image I¯ inferred by our model given
i.e. following Eq. 5 and Eq. 6. For the remaining N − N 0 some samples S. We visualize in Figure 3 our predictions
pixels, we simply use their mean value estimate conditioned on the Cat Faces dataset using varying number of input
on the initial and generated K + N 0 samples (using Eq. 4). samples. We observe that even when using as few as 4
While this may lead to some loss in detail, we qualitatively pixels in S, our model predicts a cat-like mean image that,
show that the effects are not prohibitive and that the sample with some exceptions, captures the coarse color accurately.
diversity is preserved.
PixelTransformer: Sample Conditioned Signal Generation

Figure 4. Image Samples. Sample images generated by our learned model on three datasets (left: MNIST, middle: Cat Faces, right:
CIFAR10) given |S| = 32 observed pixels. Top row: ground-truth image from which S is drawn. Row 2: A nearest neighbor visualization
of S – for each image pixel we assign it the color of the closest observed sample in S. Rows 3-5: Randomly sampled images from p(I|S).

Reconstruction Accuracy on CIFAR10 Classification Accuracy on CIFAR10


1.0 1.0
0.8 0.8

Accuracy
SSIM 0.6 0.6
0.4 0.4
0.2 Decoder + Optimization 0.2 GT Image
Ours (Mean Image) Mean Image
Ours (Image Samples) Image Samples
0.0 0 500 1000 1500 2000 0.0 0 500 1000 1500 2000
|S| |S|

Figure 6. Reconstruction Accu-Figure 7. Classification Accu-


racy of generated images. racy of generated images.

age given some pixels from top/bottom of two different


images. We see that, despite some mismatch in the align-
Figure 5. Image Composition. Generation results when drawing
pixels from two different images. Top row: the composed image ment/texture of the underlying faces, our model is able to
from which S is drawn. Row 2: A nearest neighbor visualization compose them to generate a plausible new image.
of S. Row 3: Randomly sampled image from p(I|S). Reconstruction and Classification Accuracy. In addition
to visually inspecting the mean and sampled images, we
A very small number of pixels, however, is not sufficiently
also quantitatively evaluate them using reconstruction and
informative of the pose/shape of the head, which become
classification based metrics on the CIFAR10 dataset. First,
more accurate given around 100 samples. As expected,
we measure how similar our obtained images are to the
the mean image becomes closer to the true image given
underlying ground-truth image. Figure 6 plots this accuracy
additional samples, with the later ones even matching finer
for varying size of S – we compute this plot using 128 test
details e.g. eye color, indicating that the distribution p(I|S)
images, varying |S| from 4 to 2048 for each. When reporting
reduces in variance as |S| increases.
the accuracy for sampled images, we draw 3 samples per
Qualitative Results: Sampling Images. While examining instance and use the average performance. We also report
the mean image assures us that our average prediction is a baseline that uses a pretrained decoder(from a VAE) and
meaningful, it does not inform us about samples drawn from optimizes the latent variable to best match the pixels in S
p(I|S). In Figure 4, we show results on images from each (see appendix for details). We observe that our predicted
of the three datasets considered using |S|=32 randomly ob- images, more so than the baseline, match the true image.
served pixel values in each case. We see that the sampled Additionally, the mean image is slightly more ‘accurate’ in
images vary meaningfully (e.g. face textures) while pre- terms of reconstruction than the sampled ones – perhaps
serving the coarse structure, though we do observe some because the diversity of samples makes them more different.
artefacts e.g. missing horse legs.
We also plot the classification accuracy of the generated
As an additional application, we can generate images by images in Figure 7. To do so, we use a pretrained ResNet-
mixing pixel samples from different images. We showcase 18 (He et al., 2016) based classifier and measure whether
some results in Figure 5 where we show one generated im- the correct class label is inferred from our generated images.
PixelTransformer: Sample Conditioned Signal Generation

Figure 8. Shape Generation. Sample 3D shapes generated given |S| = 32 observed SDF values at random locations. Top row: ground-
truth 3D shape. Row 2: A visualization of S – a sphere is centred at each position with color indicating value (red implies higher SDF).
Rows 3-5: Randomly sampled 3D shapes from our predicted conditional distribution.

Figure 10. Video Synthesis. Sample videos generated by our


Figure 9. Polynomial Prediction. Mean and sampled polynomi- model given |S|=1024 observed pixels across 34 frames. Top
als generated by our learned model. Row 1: Predictions using row: 4 uniformly sampled frames of the ground-truth video. Row
|S| = 4 samples (red dots). Row 1: Predictions using |S| = 6. 2: A nearest neighbor visualization of S. Rows 3-5: Randomly
sampled videos from the predicted conditional distribution.

Interestingly, we see that even if using images generated is applicable beyond images. In particular, assuming the
from as few as 16 pixels, we obtain about a 30% classi- availability of (unlabeled) examples, our approach can learn
fication accuracy (or over 60% with 128 pixels). As we to generate any dense spatial signal given some (position,
observe more pixels, the accuracy matches that of using the value) samples. In this section, we empirically demonstrate
ground-truth images. Finally, we see that using the sampled this by learning to generate 1D (polynomial) and 3D (shapes
images yields better results compared to the mean image, as and videos) signals using our framework.
the sampled ones look more ‘real’.
We would like to emphasize that across these settings, where
we are learning to generate rather different spatial signals,
6. Beyond Images: 1D and 3D Signals we use the same training objective and model design. That
While we leveraged our proposed framework for generat- is, except for the dimensionality of input/output layers and
ing images given some pixel observations, our formulation distribution parametrization to handle the corresponding
PixelTransformer: Sample Conditioned Signal Generation

inputs/outputs x ∈ Rx , v ∈ Rv , our model or learning 6.3. Synthesizing Videos


objective is not modified in any way specific to the domain. Lastly, we examine the domain of ‘higher-dimensional’ im-
ages (e.g. videos). In particular, we use the subset of ‘beach’
6.1. Polynomial Prediction
videos in the TinyVideos dataset (Vondrick et al., 2016;
As an illustrative example to study our method, we consider Thomee et al., 2016) (with a random 80% − 20% train-test
a classical task – given a sparse set of (x, g(x)) pairs, where split) and train our model to generate video clips with 34
x, g(x) ∈ R1 , we want to predict the value of g over its frames. Note that these naturally correspond to 3D spa-
domain. We randomly generate 6-degree polynomials, draw tial signals, as the position x includes timeframe ∈ R1 in
from 4 to 20 samples to obtain S, and learn fθ to predict addition to a pixel coordinate.
distribution of values at |Q|=20 query locations. One sim-
plification compared to the model used for images is we We train our model fθ to generate the underlying signal
use B = 1 instead of B = 256 (i.e. a simple gaussian distribution given sparse pixel samples where we randomly
distribution) to parametrize the output distribution. choose a frame and pixel coordinate for each sample. We
empirically observe that due to the high complexity of the
We visualize our predictions in Figure 9, where the columns output space, using only a small number of samples does
correspond to different polynomials, and the rows depict our not provide significant information for learning generation.
results with varying number of inputs in S. We see that the We therefore train our model using more samples than the
various sample functions we predict are diverse and mean- image generation task – varying |S| between 512 to 2048
ingful, while being constrained by the observed position, (this corresponds to 30 pixels per frame).
value pairs. Additionally, as the number of observations in
S increase, the variance of the function distribution reduces We present representative results in Figure 10 but also en-
and matches the true signal more closely. courage the reader to see the videos in the project page. Our
model generates plausible videos with some variation e.g.
6.2. Generating 3D Shapes flow of waves and captures the coarse structure of the output
well. However, the predictions lack precise detail. We at-
We next address the task of generating 3D shapes repre-
tribute this to the limited number of pixels we can generate
sented as signed distance fields (SDFs). We consider the
autoregressively (see discussion in Section 4.3 on memory
category of chairs using models from 3D Warehouse (3DW),
bottlenecks) and hypothesize that a higher number maybe
leveraging the subset recommended by Chang et al. (Chang
needed for modeling these richer signals.
et al., 2015). We use the train/test splits provided by (Xu
et al., 2019), with 5268 shapes used for training, and 1311
for testing. We extract a SDF representation for each shape 7. Discussion
as a grid of size 643 , with each location recording a con- We proposed a probabilistic generative model capable of
tinuous signed distance value – this dense representation is generating images conditioned on a set of random observed
better suited for our approach compared to sparse occupan- pixels, or more generally, synthesizing spatial signals given
cies. Our training procedure is exactly the same as the one sparse samples. At the core of our approach is a learned
used for 2D images – we sample the SDF grid at random function that predicts value distributions at any query loca-
locations to generate S, Q, with the number of samples in S tion given an arbitrary set of observed samples. While we
varying from 4 to 2048, and |Q| being 2048. obtain encouraging results across some domains, there are
several aspects which could be improved e.g. scalability,
We present some randomly chosen 3D shapes generated
perceptual quality, and handling sparse signals. To allow
by our model when using |S| = 32 in Figure 8. While
better scaling, it could be possible to generalize the outputs
we actually generate a per-location signed distance value,
from distributions over individual pixels to those over a
we extract a mesh using marching cubes to visualize this
vocabulary of tokens encoding local patches or investigate
prediction. As the results indicate, even when using only 32
strategies to better select conditioning subsets (e.g. nearest
samples from such a high-dimensional 3D spatial signal, our
samples). The perceptual quality of our results could be
model is able to generate diverse and plausible 3D shapes.
further improved and incorporating adversarial objectives
In particular, even though this is not explicitly enforced, our
maybe a promising direction. Finally, while our framework
model generates symmetric shapes and the variations are
allowed generating pixel values, we envision that a similar
semantically meaningful as well as globally coherent e.g.
approach could predict other dense properties of interest e.g.
slope of chair back, handles with or without holes. However,
semantic labels, depth, generic features.
as our model generates the SDF representation, and does
not directly produce a mesh, we often see some artefacts Acknowledgements. We would like to thank Deepak
in the resulting mesh e.g. disconnected components, which Pathak and the members of the CMU Visual Robot Learning
can occur when thresholding a slightly inconsistent SDF. lab for helpful discussions and feedback.
PixelTransformer: Sample Conditioned Signal Generation

References Murphy, K. P. Machine learning: a probabilistic perspective.


2012.
3D Warehouse. https://3dwarehouse.sketchup.
com/. Osher, S., Shi, Z., and Zhu, W. Low dimensional manifold
Bojanowski, P., Joulin, A., Lopez-Pas, D., and Szlam, A. model for image processing. SIAM J. Imaging Sci., 2017.
Optimizing the latent space of generative networks. In
Park, J. J., Florence, P., Straub, J., Newcombe, R., and
ICML, 2018.
Lovegrove, S. Deepsdf: Learning continuous signed
Brochu, E., Cora, V. M., and De Freitas, N. A tutorial distance functions for shape representation. In CVPR,
on bayesian optimization of expensive cost functions, 2019.
with application to active user modeling and hierarchical
reinforcement learning. arXiv preprint arXiv:1012.2599, Parmar, N. J., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer,
2010. N., and Ku, A. Image transformer. 2018. URL https:
//arxiv.org/abs/1802.05751.
Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P.,
Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Rasmussen, C. E. Gaussian processes in machine learning.
Su, H., et al. Shapenet: An information-rich 3d model In Summer School on Machine Learning. Springer, 2003.
repository. arXiv preprint arXiv:1512.03012, 2015.
Razavi, A., van den Oord, A., and Vinyals, O. Generating
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Dhariwal, diverse high-fidelity images with vq-vae-2. In NeurIPS,
P., Luan, D., and Sutskever, I. Generative pretraining 2019.
from pixels. In ICML, 2020.
Rezende, D. J. and Mohamed, S. Variational inference with
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., normalizing flows. arXiv preprint arXiv:1505.05770,
Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. 2015.
Generative adversarial nets. In NeurIPS, 2014.
Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual Pixelcnn++: Improving the pixelcnn with discretized lo-
learning for image recognition. In CVPR, 2016. gistic mixture likelihood and other modifications. ICRL,
Jain, A., Abbeel, P., and Pathak, D. Locally masked convo- 2017.
lution for autoregressive models. In UAI, 2020.
Sitzmann, V., Thies, J., Heide, F., Nießner, M., Wetzstein,
Kingma, D. P. and Welling, M. Auto-encoding variational G., and Zollhofer, M. Deepvoxels: Learning persistent
bayes. arXiv preprint arXiv:1312.6114, 2013. 3d feature embeddings. In CVPR, 2019.

Krizhevsky, A. Learning multiple layers of features from Sitzmann, V., Martel, J., Bergman, A., Lindell, D., and Wet-
tiny images. 2009. zstein, G. Implicit neural representations with periodic
activation functions. NeurIPS, 2020.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
based learning applied to document recognition. Proceed- Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil,
ings of the IEEE, 86(11):2278–2324, 1998. S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron,
J., and Ng, R. Fourier features let networks learn high fre-
Li, K. and Malik, J. Implicit maximum likelihood estimation.
quency functions in low dimensional domains. NeurIPS,
arXiv preprint arXiv:1809.09087, 2018.
2020.
Martin-Brualla, R., Radwan, N., Sajjadi, M. S. M., Barron,
J. T., Dosovitskiy, A., and Duckworth, D. NeRF in the Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni,
Wild: Neural Radiance Fields for Unconstrained Photo K., Poland, D., Borth, D., and Li, L.-J. Yfcc100m: The
Collections. In CVPR, 2021. new data in multimedia research. Communications of the
ACM, 59(2):64–73, 2016.
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S.,
and Geiger, A. Occupancy networks: Learning 3d recon- Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep image
struction in function space. In CVPR, 2019. prior. In CVPR, 2018.

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., van den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,
Ramamoorthi, R., and Ng, R. Nerf: Representing scenes Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and
as neural radiance fields for view synthesis. In ECCV, Kavukcuoglu, K. Wavenet: A generative model for raw
2020. audio. arXiv preprint arXiv:1609.03499, 2016a.
PixelTransformer: Sample Conditioned Signal Generation

van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, Appendix
O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In NeurIPS, 2016b. Log-likelihood under Value Distribution. The pre-
dicted value distribution for a query position x is of the
van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, form p(v; ω), where ω ≡ {(q b , µb , σ b )}B
b=1 . We reiterate
K. Pixel recurrent neural networks. arXiv preprint q b ∈ R1 is the probability of assignment to bin b, cb + µb
arXiv:1601.06759, 2016c. is the mean of the corresponding gaussian distribution with
uniform variance σ b ∈ R1 .
van den Oord, A., Vinyals, O., et al. Neural discrete repre-
sentation learning. In NeurIPS, 2017. Under this parametrization, we compute the log-likelihood
of a value v∗ by finding the closest bin b∗ , and computing
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, the log-likelihood of assignment to this bin as well as the
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention log-probability of the value under the corresponding gaus-
is all you need. In NeurIPS, 2017. sian. We additionally use a weight α = 0.1 to balance the
classification and gaussian log-likelihood terms.
Vondrick, C., Pirsiavash, H., and Torralba, A. Generating
videos with scene dynamics. In NeurIPS, 2016. b∗ = argminb kv ∗ − cb k
∗ ∗
Wu, S., Rupprecht, C., and Vedaldi, A. Unsupervised learn- ∗ ∗ v∗ − cb − µb 2
log p(v ∗ ; ω) ≡ log q b − α(log σ b + ( ) )
ing of probably symmetric deformable 3d objects from σ b∗
images in the wild. In CVPR, 2020.
VAE Training and Inference. We train a variational auto-
Xu, Q., Wang, W., Ceylan, D., Mech, R., and Neumann,
encoder (Kingma & Welling, 2013) on the CIFAR10 dataset
U. Disn: Deep implicit surface network for high-quality
with a bottleneck layer of dimension 4 × 4 × 64 i.e. spa-
single-view 3d reconstruction. In NeurIPS, 2019.
tial size 4 and feature size 64. We consequently obtain a
Zhang, K., Riegler, G., Snavely, N., and Koltun, V. Nerf++: decoder D which we use for inference given some observed
Analyzing and improving neural radiance fields. arXiv samples S. Specifically, we optimize for an optimal latent
preprint arXiv:2010.07492, 2020. variable the minimizes the reconstruction loss for the ob-
served samples (with an additional prior biasing towards
the zero vector). Denoting by I(x) the value of image I
(bilinearly sampled) at position x, the image I ∗ inferred
using a decoder D by optimizing over S can be computed
as:

z ∗ = argminz L(D(z), S) + 0.001 ∗ kzk2 ; I ∗ = D(z ∗ )


L(I, {(xk , vk )}) = E kI(xk ) − vk k1
k

You might also like